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ANALYZING MOLECULE AND PROTEIN DIVERSITY 

This application claims priority from Provisional United States Patent 
Application Serial Number 60/127,486, filed April 2, 1999, and incorporated by 
5 reference. 

This invention relates to analyzing molecule and protein diversity. 
Combinatorial chemistry allows the creation of unprecedented numbers of 
organic compounds. The rational synthesis of millions of small organic molecules 
is now achievable in a matter of days. There are estimated to be more than ten to 

1 0 the hundredth power of small molecules that could be synthesized using current 
methods. By "small", we mean a molecule having fewer than 1500 Daltons, 
where a Dalton is defined as 1/12 of the weight of a carbon 12 atom or roughly 
the weight of a hydrogen atom. 

An important question is how can one create a set of molecules of such 

1 5 diversity as to contain at least one potent binder to any given target of interest? 
This question is central to drug discovery in an era that is characterized by a 
growing wealth of DNA sequence information and a relative dearth of 
corresponding target structures and their functions. In a world in which there are 
many more putative targets than can be studied by x-ray crystallography, multi- 

20 dimensional NMR, or other high resolution biophysical techniques, any attempt 
to generate biologically active ligands to targets of unknown structure will 
require general screening libraries: libraries of molecules that cover a high 
percentage of so-called "diversity space". 

The utility of small molecules as drugs depends in part on molecular 

25 complementarity: how well the molecules fit and/or stick to chemically active 

sites (often in the form of depressions in a protein) on the surface of a cell, on the 
surface of an intracellular organelle, or on a cytosolic protein. The potential 
molecular complementarity of a small molecule is in large part determined by 
two factors: 

30 1) The shape of the molecule, meaning the total Van Der Waals (VDW) 

surface of a given conformation of the molecule and how it follows or does not 
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follow the VDW surface of the target site of interest. The shape complementarity 
of molecule and target are largely responsible for energetic forces such as the 
displacement of water (the "hydrophobic effect") and so called "Van Der Waals" 
or "London dispersion" forces. 
5 2) The types of potential energetic interactions (such as hydrogen 

bonding, percentage ionic bonding, proximity of polarizable moieties) found at 
various places on the molecule, and on the manner, order, and spatial orientation 
in which the energetically interactive portions of the molecule are connected to 
each other and presented in the presence of the target of interest. 

10 Typically, each of these factors is measured or calculated in only a 

relative way with respect to a specific set of molecules or with respect to specific 
protein surfaces being examined. A general comparison of potential molecular 
complementarity between two sets of molecules, however, requires doing 
calculations or experiments using an absolute or fixed frame of reference. 

1 5 For example, if company 1 has tested a molecule set A for affinity to a set 

of target surfaces X found in cancerous cells, and company 2 has tested a 
molecule set B against a set of target surfaces Y found in nervous tissue, the 
value of molecule set B with respect to the target surfaces X of company 1 is not 
apparent because each of the affinity evaluations has been performed against a 
20 different standard. Also, without a full molecular calculation of set A against 

target surfaces Y, it is not apparent whether potential chemically active portions 
of the target surface Y could be bonded by molecules in set A. Finally, even if all 
calculations of sets A and B vs. surfaces X are performed, neither company will 
gain a measure of the "absolute diversity" of their molecule sets; that is, they will 
25 have no measure of the likelihood that either sets A or B will contain a molecule 
that has potential activity against any given target T. This is because the standard 
against which they have measured their molecules are only a small subset of 
potential target surfaces. 



30 
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Implementations of the invention provide an absolute or fixed frame of 
reference for comparing sets of molecules and protein surfaces. By defining 
molecules in terms of their complementarity or attraction to a fully enumerated 
basis set of theoretical protein surfaces within given parameters, it is possible to 
5 measure the diversity of a set of molecules against a non relative standard, i.e., an 
absolute measurement. This allows for an efficient comparison of different sets of 
molecules, for example, sets of drugs, and enables meaningful categorization of 
classes of molecules against a standard set of surfaces. Furthermore, this allows 
for the detection of theoretical protein surfaces to which no molecules in a set are 

10 complementary, thus enhancing the ability of a chemist to design novel 

molecules that supplement the deficiencies of the original set. By defining real 
world protein surfaces in terms of their similarity to a basis set of theoretical 
protein surfaces, it is similarly possible to categorize real world sets of protein 
surfaces against a standard set of theoretical surfaces. This allows for improved 

15 classification of proteins into similar target classes by the similarity of their 

surface sites. By evaluating an actual set of molecules against theoretical protein 
surfaces and further by evaluating a set of real world protein surfaces of interest 
against the same theoretical protein surfaces, it is possible to select classes of 
molecules that are likely to have substantial activity against the real world protein 

20 surfaces. This allows for improved molecule screening, for example, in drug 
research. By evaluating an actual set of molecules against theoretical protein 
surfaces and further by evaluating a set of real world protein surfaces of interest 
against the same theoretical protein surfaces, it is also possible to find actual 
protein surfaces to which no molecule in the set A has likely activity. This 

25 enhances the ability of a chemist to design molecules beyond those in set A that 
match the previously unmatched protein surfaces of interest, and may thereby be 
tested for pharmaceutical activity, thus supplementing deficiencies in the original 
set of molecules. 

Thus, in general, in one aspect, the invention features a computer-based 
30 method in which a set of constraints on possible target surfaces is defined, and a 
fully enumerated set of theoretical target surfaces under the defined constraints is 
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also defined, such that each surface has a defined, continuous volume and a 
defined, continuous surface area. One or more sets of objects are mapped to the 
fully enumerated set of theoretical target surfaces to define corresponding subsets 
of the fully enumerated set of theoretical target surfaces. An aspect of diversity 
5 of the objects is analyzed based on degrees of similarities and differences among 
the corresponding subsets. 

Implementations of the invention may include one or more of the 
following features. The target surfaces may include negative space target 
surfaces. The objects may include positive space object surfaces associated with 
1 0 different molecules. The objects may be mapped by defining corresponding 

subsets of the fully enumerated set of negative space theoretical target surfaces to 
which positive space object surfaces of conformations of molecules are 
complementary. The aspect of diversity that is analyzed may be the difference or 
similarity between the molecules which map to those negative space theoretical 
1 5 target surfaces. 

The objects may include negative space object surfaces associated with 
different proteins, and the objects may be mapped by defining corresponding 
subsets of the fully enumerated set of negative space theoretical target surfaces to 
which negative space object surfaces of protein pockets are similar. The aspect of 
20 diversity that is analyzed may be the difference or similarity between protein 
pockets which map to those negative space theoretical target surfaces. The 
objects may include positive space object surfaces associated with different 
molecules and negative space object surfaces associated with different proteins. 
In the case of molecules, the objects may be mapped by defining corresponding 
25 subsets of the fully enumerated set of negative space theoretical target surfaces to 
which positive space object surfaces of conformations of molecules are 
complementary. In the case of proteins, the objects may be mapped by defining 
corresponding subsets of the fully enumerated set of negative space theoretical 
target surfaces to which negative space object surfaces of protein pockets are 
30 similar. The aspect of diversity that is analyzed may be the difference or 

similarity of the molecules which map to those negative space theoretical target 
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surfaces to the protein pockets which map to those negative space theoretical 
target surfaces. 

The theoretical target surfaces and the objects may be polyhedrons, e.g., 
cubes, all of the same size and shape. The set of all theoretical target surfaces 
5 defines a diversity space within which the diversity of objects can be measured 
by mapping those objects to the diversity space. Regions of the diversity space to 
which no objects map may be identified, and molecules may be designed that 
occupy at least one of the unfilled theoretical target surfaces of the diversity 
space. 

10 Complementarity may be associated with binding affinities of positive 

space object surfaces of conformations of molecules to negative space theoretical 
target surfaces. 

The constraints may include volume, associations of each of a number of 
sites of the target surface with a preselected molecular property drawn from a 

15 larger set of possible molecular properties, including hydrophobic, polarizable, 
H-bond acceptor, H-bond donor, H-bond donor/acceptor, potentially positively 
charged, and potentially negatively charged. Fewer than all of the sites of the 
target surface may each be associated with a different one of the molecular 
properties and all of the other sites of the target surface may be associated with a 

20 common molecular property, such as slightly hydrophobic. The degrees of 

similarities or differences may involve functional properties associated with the 
corresponding subsets of the fully enumerated set of theoretical target surfaces or 
shape properties associated with the corresponding subsets of the fully 
enumerated set of theoretical target surfaces. 

25 Each of the objects may be defined by quantizing molecules into 

polyhedrons. Each of a fixed set of orientations of each conformation of each of 
the objects may be fitted to each of the target surfaces, and each of the fittings 
may be scored. 

The constraints may include a resolution of the polyhedrons, e.g., 4.24 
30 Angstroms, or maximum and minimum numbers of polyhedrons. 
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Each of the polyhedrons may share a common interface with another of 
the polyhedrons. The constraints may also include the absence of any occluded 
volumes greater than a given user-defined parameter. The target surfaces may be 
defined conceptually as having been carved out of a flat surface. 
5 In general, in another aspect, the invention features categorizing existing 

molecules based on negative space target surfaces to which conformations of the 
molecules are complementary, and designing novel molecules that are 
complementary to negative space target surfaces to which no conformations of 
the existing molecular are complementary. 
1 0 In general, in another aspect, the invention features a method of creating 

novel molecules to be tested as ligands for proteins. In the method, proteins are 
categorized based on target surfaces to which their pockets of known structure 
map, and novel molecules are designed that are complementary to the negative 
space target surfaces to which the protein pockets map. 
15 In general, in another aspect, the invention features a computer 

programmed to determine the chemical similarity of different molecules. The 
program approximates the surface shape of each one of a plurality of molecules 
of interest by linking a series of cubes, each cube having a dimension R, the 
locations of the cubes being determined by the calculated electron probability 
20 density of the individual one of the molecules of interest, each cube sharing at 

least one of its six faces with another cube, such that there is a specific number of 
linked cubes which varies for each individual one of the plurality of molecules of 
interest. The chemical reactivity of each individual one of the plurality of 
molecules of interest is approximated by assigning each cube of each individual 
25 one of the plurality of molecules of interest, no more than one functionality value 
from a plurality of M different chemical functionality values. The surface shape 
and chemical reactivity of a chemically active surface having a volume equal to 
V is approximated by subtracting a number V/R 3 cubes of dimension R from a 
surface, wherein each of the cube spaces shares at least one face with another 
30 cube space and wherein N of the cube spaces has one of a plurality of M different 
chemical functionality values. An attraction value K is calculated for each one of 
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the plurality of molecules of interest to the chemically active surface. A list of 
overall attraction values to the chemically active surface is calculated. 

Implementations of the invention may include one or more of the 
following features. The calculation of the attraction value K may be performed on 
5 a plurality of different predetermined chemically active surfaces, and a matrix of 
overall attractive values of each molecule of interest to each of the different 
surfaces may be calculated. The molecules of interest may include organic 
molecules. The chemically active surface having a plurality of predetermined 
active chemical locations may be calculated to correspond to the shape of an 

1 0 actual protein surface structure. The molecules of interest may be organic 
molecules of 1500 Daltons or less. The chemically active surface having a 
plurality of predetermined active chemical locations may be compared to an 
actual protein surface to calculate a similarity value of the actual protein surface 
to the predetermined active chemical locations. The predetermined chemically 

1 5 active surfaces may be compared to a plurality of actual protein surfaces and a 
matrix of similarity values may be calculated. The cube spaces subtracted from 
the surface may be calculated to approximate the electron probability density of 
at least one of a plurality of depressions in known protein surface structures. The 
N sites of chemical functionality may be calculated to approximate the location 

20 and type of chemical functionality of actual depressions in known protein 
structures. 

Other advantages and features will become apparent from the following 
description and from the claims. 



25 DESCRIPTION 

Figure 1 shows a (CH 2 ) n chain encapsulated by 4.24 A cubic units. 
Figure 2 shows examples of surfaces allowed and disallowed by the non- 
occlusion parameter in a theoretical target surface generation algorithm. Gray 
shading represents the opening of the theoretical surface. A is allowed. B is 
30 disallowed due to two occluded negative space cubes (marked X). 
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Figure 3 shows a theoretical target surface of 13 negative space cubes and 
four sites of specific molecular property interaction: hydrophobic (white), 
polarizable (purple), H-bond accepting (green), and H-bond donating (orange). 
Blue shading indicates the opening of the theoretical surface. 
5 Figure 4 shows a "quantized" representation (Q-file) of one conformation 

of molecule 6a superimposed on its atomic structure (ball and stick and space- 
filling model). Molecular property characteristics of the Q-file are hydrophobic 
(white quanta), polarizable (purple quanta), H-bond accepting (green quanta), and 
negatively charged (red quanta). 
1 0 Figure 5 shows test molecules. 

Figure 6 shows a ranking of molecules by QCSD similarity scores. 
Figure 7 illustrates examination of a theoretical target surface common to 
molecules 8c (top) and 8a (bottom). Blue shading indicates opening of the 
theoretical surface. Specific points of complementarity on the theoretical target 
1 5 surface are hydrophobic (white) polarizable (purple) and H-bond donating 
(orange). Superimposition of the original molecular conformations onto the 
theoretical target surface demonstrates that the extra phenyl substituent of 8c 
protrudes from the opening of the theoretical surface and is not involved in 
complementarity to the surface. 
20 Figure 8 illustrates examination of a theoretical target surface common to 

molecules la (A) and 5a (B). Blue shading indicates opening of the theoretical 
surface. Specific points of surface complementarity found by QSCD are 
hydrophobic (white), polarizable (purple), and H-bond donating/accepting 
(yellow). Overlay plot (C) of the non-hydrogen backbones of la (orange) and 5a 
15 (green) indicate similar features. Aromatic/hydrophobic overlaps are shown in 
purple; H-Bond donating oxygens are in red. Overlay generated with Sybyl 
version 6.5 (Tripos Inc, 1699 S. Hanley Rd., St, Louis, MO, 63144). 

Figure 9 illustrates ranking of molecules in Figure 5 by Tanimoto 
similarity score of 2D UNITY fingerprints. 
50 Figure 1 0 shows a QSCD plot of all of the theoretical surface shapes 

covered by all of the conformations of all of the molecules (blue dots) in Figure 
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5. The total volume of the cube encompasses all 49,268,918 theoretical surface 
shapes as listed in Table 1 . Red dots show two exemplary theoretical surface 
shapes (a, b) not covered by any of the molecules in Fig. 5. Axes used are 
functions of opening area, opening length/width, and depth per opening quantum. 

Figure 1 1 shows a map of the 20 compounds in Fig. 5 (blue dots) in a 
representative BCUT three-axis diversity space. BCUT axes used are, 
respectively: 1) BCUT HACCEPT S INVDIST 050 R H, 2) BCUT HDONOR S 
INVDIST 030 R H, and 3) BCUT TAB POLAR S INVDIST 300 R L. Red dot 
shows an unfilled coordinate of diversity space, at (7.54, 7.25, 6.82). The 
information contained in this BCUT coordinate does not reveal information about 
the shape of a molecule which might be able to fill this position in diversity 
space. 

Figure 12 shows use of QSCD to design complementary combinatorial 
libraries to unmatched theoretical target surfaces. Many conceivable libraries of a 
given shape and functionality may be designed to fill a given unmet diversity 
need. 

Figure 13 shows two sample surfaces. 
Figure 14 illustrates the quantization process. 

Figure 15 shows a legend for symbols used in functionality rule diagrams. 

Figure 16 shows functionality rules for Potential Negative Charge 
Functionality. The structures are searched for in order. 

Figure 17 shows functionality rules for Potential Positive Charge 
Functionality. The structures are searched for in order. 

Figure 1 8 shows a functionality rule for Hydrogen Bond Donor/ Acceptor 
Functionality. 

Figure 19 shows a functionality rule for Hydrogen Bond Donor 
Functionality. 

Figure 20 shows a functionality rule for Hydrogen Bond Acceptor 
Functionality. The structures are searched for in order. 

Figure 21 shows a functionality rule for Polarizable Functionality. 
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Figure 22 shows Table 5, a ranking of molecules in Fig. 5 by QSCD 
diversity score. Blue = homogeneous pairs, yellow = -fphenyl pairs (8c), green = 
ATI-AT2 pairs (3,4) 

Figure 23 shows Table 6, a ranking of molecules in Fig. 5 by Tanimoto 
5 similarity score of 2D UNITY fingerprints. Blue = homogeneous pairs, yellow = 
+phenyl pairs (8c), green = ATI-AT2 pairs (3,4) 

Figure 24 shows an example subset of theoretical surfaces Ti containing 4 
members and an example central set Ci for Ti (F = 1, E = 3) where the black face 
denotes a point of attachment Al on Ci: 
10 Figure 25 shows an example Core Molecule Mi to fill Central set Ci. 

Figure 26 shows an example Library (Mi,B) where B = a set of amines. 

Figure 27 shows an example subset of target surfaces Ti containing 4 
members and an Example Central set Ci for Ti (F = 1 , E = 3), where the black 
face denotes a point of attachment Al on Ci: 
1 5 Figure 28 shows an Example Core Molecule Mi to fill Central set Ci. 

Figure 29 shows an example Library L(Mi, B) where B = a set of amines. 

Figure 30 shows a protein quantization process. 



20 We define diversity as the measure, based on pre-defined criteria, of the 

difference or similarity among all members of a set. In a pharmaceutical setting, 
molecular diversity can be defined as the measure, based on biological criteria, of 
the difference or similarity between small molecules. Each of the existing 
methods of calculating biologically relevant diversity of small molecules defines 

25 slightly different criteria for molecular comparison, and thus a different 

configuration of diversity space as a whole. Examples include low dimensional 
diversity space such as BCUT metrics, high dimensional diversity space such as 
Chem-X/ChemDiverse multiple point pharmacophores, and empirical biological 
diversity space such as affinity fingerprinting. 

30 Many known ways of quantifying the diversity of molecules use 

molecular properties such as functionality and connectivity as a basis for 



BNSDOCID: <WO 0060507A2_I_> 



WO 00/60507 



PCT/USOO/08777 



11 

categorization (see for instance Potter and Matter, J. Med. Cheni^ 1998, p. 478). 
For example, in the BCUT method used to generate four- to six-dimensional 
diversity space, molecules are broken down into matrices according to 
connectivity and molecular interaction properties. Coordinates in diversity space 
5 are assigned through the resulting eigenvalues of these matrices, leading to useful 
multi-dimensional plots of molecular diversity. However, because the use of 
eigenvalues is an irreversible transformation (different 3D shapes can map to the 
same eigenvalues), it follows that an empty coordinate in BCUT diversity space 
cannot be translated into a 3D template of a "missing molecule." Thus, while a 
10 model such as BCUT diversity is well validated as a tool for finding 

combinatorial matches to a lead compound or pharmacophore, it cannot be 
directly used to populate the entire diversity space that it defines. 

Similarly, in the popular Chem-X/ChemDiverse diversity package, 
molecules are broken down into all accessible three- or four-point 
1 5 pharmacophores of triangular or tetrahedral functionality distances. If the model 
is used to display molecular diversity, coordinates in diversity space are assigned 
through the resulting string of accessible three- or four-point pharmacophores; 
this method has been shown to be highly effective in classifying molecules by 
pharmacological similarity. However, the mapping of complex 3D shapes to a set 
20 of triangular or tetrahedral functionality distances is an irreversible 

transformation; empty three- or four-point pharmacophores in Chem-X derived 
diversity space cannot be translated into a 3D template of a complex shape. Since 
a set of coordinates in Chem-X is insufficient to define the shape of "missing 
molecules," Chem-X cannot be used to directly populate empty molecular 
25 diversity space. 

Another example of current diversity methods is affinity fingerprinting, in 
which molecules are empirically assayed against a panel of 10-20 actual proteins 
selected to be promiscuous in their ability to bind small molecules. Position in 
molecular diversity space is assigned through the resulting string of IC50 binding 
30 values, and these affinity fingerprints provide unprecedented ability to group 

similarly active compounds in diversity space. However, because the actual mode 
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of binding in any assay is not incorporated in the resulting IC50 value, the 
mapping of molecules to the selected protein panel is an irreversible 
transformation. Thus, an empty coordinate in affinity fingerprinting diversity 
space (an "unmatched" string of IC50S to a given protein panel) cannot be back- 
translated into a 3D molecular template. A similar affinity fingerprinting diversity 
method has been put into practice using a panel of computational surfaces of real- 
world protein pockets and a modified form of the DOCK program. While this 
method shows similar promise in its ability to detect pharmacological similarity, 
it is, like its empirical affinity fingerprinting counterpart, an irreversible mapping. 

Thus, for the most part, current methods are able to successfully identify 
compounds of the same pharmacological class as being similar and compounds of 
different pharmacological classes as being different. Given a starting 
pharmacophore from known ligands and/or the target site of a target crystal 
structure, such methods interface well with the design of complementary 
combinatorial libraries. 

The design of combinatorial libraries to cover all of diversity space is a 
rather different problem, however. In this case, it is not enough to be able to 
compare existing molecules for differences or similarities. In addition to being 
able to place molecules relative to one another in diversity space, one must be 
able to point to an absolute area of diversity space not yet covered and from its 
coordinates design a novel set of compounds to fill that uncovered space. 

In order to rationally and systematically fill diversity space, an 
informationally reversible diversity model is needed. This model must be 
formulated such that: members (in this case molecules) can be assigned to 
coordinates for similarity/dissimilarity comparison, and empty coordinates retain 
the information necessary to directly generate coordinate membership. 

One good path for such a model is to use as coordinates the exact 
information that differentiates one member from another, without intervening, 
irreversible transformations. To apply this reasoning to molecular diversity, it 
must first be asked: what are the criteria by which diversity of compounds is to be 
measured (what information differentiates one molecule from another). One of 
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the most fundamental criterion in molecular drug discovery is the extent to which 
two molecules have similar or different binding affinities to a given target. With 
the assumption that similar binding affinity tracks with a molecule's 
complementarity to similar target surfaces, we have selected as our criterion for 
5 diversity complementarity to a fully enumerated set of theoretical target surfaces. 

Given the above definition of molecular diversity, it remains to provide 
parameters under which to define a biologically relevant basis set of enumerated 
theoretical target surfaces and to quantify molecular complementarity to a given 
theoretical target surface at a level which is both in accordance with known 

10 principles of molecular recognition and computationally applicable to millions of 
compounds. With a numerical determination of complementarity and a 
biologically relevant basis set of surfaces, molecular diversity space is thus 
absolutely established as the molecular complement to a fully enumerated set of 
theoretical target surfaces. 

1 5 We introduce the concept quantized surface complementarity diversity 

(QSCD), which defines a molecule numerically by a mapping that describes its 
complementarity K to every distinct theoretical protein surface of resolution R 
not exceeding volume V with N sites of M types of chemical functionality Pmn- K 
is defined as an algorithm that takes into account the molecular shape and 

20 chemical functionality of both the given molecule and the given theoretical 
protein surface. From this definition, it follows that a comparison of two 
molecules will yield a numerical difference that is representative of their 
complementarities: to the extent that two molecules each have complementarity 
for the same theoretical protein surfaces, the molecules are similar; to the extent 

25 that two molecules have no complementarity to common theoretical protein 
surfaces, the molecules are dissimilar. In other words, "similarity" between 
molecules is defined as the ability to complement the same theoretical protein 
surfaces and "difference" between molecules is defined as the ability to 
complement different theoretical protein surfaces. 

30 Because QSCD uses complementarity to theoretical protein surfaces as a 

basis for categorization, both 3-D shape and molecular functionality are taken 
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into account. Also, because QSCD uses as its basis a complete set of theoretical 
protein surfaces (under R, V, and Pmn), the method provides diversity 
information in a fixed frame of reference: coordinates in quantized surface 
complementarity diversity space are independent of any molecules or natural 
5 protein surfaces compared in that space. Because of this, not only can the 

diversity of disparate sets of molecules be compared without having to compare 
the molecules to each other directly, but the diversity of any set of actual natural 
protein surfaces can be examined through complementarity to the theoretical 
basis set. In addition, complementarity of molecules to a set of theoretical protein 
1 0 surfaces representative of actual natural protein surfaces can be examined in the 
context both of any other set of molecules and any other set of protein surfaces. 
Furthermore, given a set of molecules, it becomes immediately apparent what 
percentage of theoretical protein surfaces are covered by complementary 
molecules, thus giving a measure of the set's molecular diversity in the space 
1 5 defined by all potential surfaces under R, V, and P M n- Not only does this provide 
a measure of diversity in an absolute sense which is not relative to any 
historically biased set of surfaces or molecules, but it also makes clear a set of 
theoretical target surfaces to which no molecules in the initial set bind, allowing a 
straightforward design of novel molecules to supplement the initial set. 
20 Theoretical Target Surfaces 

In one implementation, to generate a finite set of theoretical target 
surfaces that approximates all possible binding pockets with volume equal to or 
less than V, we consider each theoretical surface to be formed by successively 
carving c ubic units out of an initially flat surface. These cubic units represent 
25 "negative space" that a potential ligand could occupy. Given cubic units with 
sides of length R (the resolution of the model), we use at most V/R 3 negative 
space cubes to describe each theoretical target surface. Others have previously 
employed cubic units to successfully approximate complementarity between 
small molecules and individual protein surfaces. 
30 The size of a negative space cube is directly related to the resolution and 

type of diversity data which the user desires as output. In choosing the size of the 
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negative space cube, one motivation is to maximize negative space cube size such 
that the difference of a single cube in a surface is highly differentiating in terms 
of molecular recognition (i.e., every surface is orthogonal to every other surface). 
At the same time, enough information must be retained in each negative space 
5 cube to predict shape and functional complementarity at a ligand/surface 

interface. The former constraint minimizes overlap of diversity information while 
the latter constraint maximizes precision of diversity information. Together, the 
competing constraints result in a basis unit for the enumeration of theoretical 
target surfaces that minimizes the number of negative space cubes needed to 

10 accurately model diversity for a given volume V. 

A resolution of 4.24 A negative space cubes was found by computer 
optimization of test molecules to provide an upper limit of cube size while still 
maintaining an acceptable level of molecular shape information. Interestingly, 
4.24 A is the approximate VDW "cross-section" of a (CH2)n chain; a series of 

15 4.24 A units neatly encapsulates a (CH2)n chain in its ground state conformation 
as shown in Figure 1 . 

In one implementation the basis set for diversity can be a set of theoretical 
.target surfaces comprised of all possible shape combinations of 6 to 14 negative 
space cubes of resolution 4.24 A (negative volume between 460 and 1070 cubic 

20 A) subject to the following rules: Surfaces are created by successively "carving 
out" negative space cubes from a flat block of infinite width and depth (the 
theoretical target). All negative space cubes of a given surface must share at least 
one face with another negative space cube of the surface, and all must be part of a 
single, contiguous negative surface. No negative space cubes may be occluded in 

25 the +Z axis of the infinite surface block; that is, there may be no solid surface 
between any negative space cube and the surface plane of the infinite block. As 
shown in figure 2, the surface A is allowed, but the surface B is disallowed. 
Surfaces duplicating a previous surface with respect to rotation in the X-Y plane 
are discarded. 

30 The occlusion rule provides a compromise between complete coverage of 

topological possibilities and acceptable computational speed. This compromise 
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was made based on the topological assumption that occlusions of 4.24 A or more 
are infrequent in small molecule/target interactions, and that their omission would 
thus have only a small effect on predicting diversity of binding affinities of small 
molecules. 

5 Applying the rules yields 49,268,91 8 unique negative surface shapes 

including chiral opposites. Covering a negative volume between 460 and 1070 
cubic A, these surface shapes are deemed sufficient to examine diversity of most 
small molecules. For instance, examining a previously published reference set of 
pharmaceutically relevant compounds (a filtered Comprehensive Medicinal 

1 0 Chemistry or CMC database), 5049 out of 5 1 20 compounds (98.6%) have a 
volume of 1070 cubic A or less. 

Within each of the 49,268,918 unique negative surface shapes, each 
negative space cube is assigned a molecular property characteristic Pm that 
represents the dominant molecular environment which any atoms that are placed 

1 5 within that negative space will experience. Properties used are PI hydrophobic, 
P2 polarizable (includes aromatics), P3 H-bond acceptor, P4 H-bond donor, P5 
H-bond donor/acceptor, P6 potentially positively charged (basic), and P7 
potentially negatively charged (acidic). These seven types of molecular 
environments are assumed to represent a minimal basis set of factors that 

20 contributes to the electrostatic/VDW complementarity of a ligand and a target 
surface. In one implementation, four positions of particular molecular property 
PI -7 are assigned, leading to 74*N!/((N-4)!*4!) surfaces for each surface shape 
of N negative space cubes. All other (N-4) cubes not assigned a particular 
molecular property are given property P8, slightly hydrophobic. The latter 

25 assignment is based on an assumption that hydrophobic effects are, on average, 
the largest single component contributing to ligand/target interaction. 
In sum, the above process implies as a basis set for molecular diversity 1.1 * 10 14 
theoretical target surfaces of negative volume between 460 and 1 070 cubic A and 
having four sites of specific molecular property characteristics PI -7. The 

30 numerical breakdown of these 1 10 trillion surfaces is listed in Table 1. 
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Table 1 : Numerical breakdown of the total number of theoretical target surfaces 
created using the algorithm given in the text. Surfaces consist of 6-14 negative 
space cubes and 4 sites of 7 possible molecular property characteristics. Number 
of functionally different surfaces per surface shape varies for infrequent cases in 
5 which a given shape has an axis of symmetry, so actual number of unique 
surfaces is slightly less than (# surface shapes) * 7 4 *N!/((N-4)!*4!). 



Volume 


Number of 


Approx. number 


Exact number of 


(number 


unique 


of functionally 


unique surfaces 


(N)of 


surface 


different surfaces 




negative 


shapes 


per unique 




space 




surface shape: 




cubes) 




7 4 *N!/((N-4)!*4!) 




6 


212 


36.015 


7,163,338 


7 


885 


84,035 


73,271,443 


8 


3,959 


168,070 


655,324,488 


9 


17,747 


302,526 


5,350,917,208 


10 


81,407 


504,210 


40,912,578,322 


11 


375,897 


792,330 


297,622,676,624 


12 


1,753,218 


1,188,495 


2,082,225,979,379 


13 


8,224,443 


1,716,715 


14,116,888,070,845 


14 


38,811,150 


2,403,401 


93,264,917,290,356 


Total 6-14 


49,268,918 




109,808,653,272,003 



10 One such surface is shown in Fig. 3. 

A pseudocode description of an algorithm for determining the set of 
theoretical target surfaces is set forth in Appendix A. 
Molecular Quantization 

To measure complementarity of small molecules to the basis set of 
1 5 theoretical target surfaces, the small molecules must be formatted in a similar 
frame of reference, for instance by quantizing them into positive space cubes 
("quanta") of resolution 4.24 A according to the following process (illustrated in 
Fig. 4): 

A set of up to 100 minimized energy conformations within user-defined 
20 parameters is created. In one implementation, Tripos Multisearch modeling is 
used, and all conformations within 10 kcal of the lowest energy conformation 
found are accepted. 
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For each conformation, a 4.24 A 3D grid of cubes (quanta) is aligned on 
top of the 3D structure using the molecule's principle axes of rotation (calculated 
with all atoms having mass 1). 

To all 4.24 A quanta which contain at least a user-defined % of the VDW 
5 radius of any atom, a dominant molecular property characteristic is assigned 
based on connectivity rules (e.g. R-[C=0]-0-H yields P 7 , R-O-H yields P 5 ; see 
definitions of P| - P 7 above). Order of dominance is from P 7 to Pi, in order of 
maximum complementarity score obtainable by a given characteristic as shown 
in Table 2: 

10 

Table 2: Relative magnitudes of parameters used in calculating molecular 
property interactions between negative space cubes (theoretical target surfaces) 
and positive space cubes (quantized molecules). Magnitudes (listed from highest 
to lowest): +++, ++, +, 0, — , — . 









Quantized 


Molecule 


Properties 






Theoretical Target 


(P7) 


(P6) 


<P5) 


(P4) 


<P3) 


<P2) 


(PI) 


Surface Properties 


neg 


pos 


hb don/acc 


hb donor 


hb acceptor 


polarizabl 
e 


hydrophobic 


(P7) neg charged 




-H-+ 


0 


+ 








(P6) pos charged 


+-H- 




0 




+ 


0 




(P5) hb don/acc 


0 


0 


++ 


+ 








(P4) hb donor 


+ 




+ 




++ 






(P3) hb acceptor 




+ 


+ 


++ 








(P2) polarizable 




0 








++ 


0 


(PI) hydrophobic 












0 


+ 


(PS) (surface only) 






0 






0 


0 



Minimum % of VDW radius parameter allows for a user-defined 
protrusion beyond the surface of a quantum cube, adding a measure of 
topological "flexibility" to the quantization process. A user defined 32% was 
20 found to be especially good. 

The total number of 4.24 A quanta that have been assigned a property 
characteristic is counted. 

The grid alignment is shifted per user-defined parameters and the process 
is repeated until all shift combinations have been searched. 
25 For each conformation in the original set, a "Q-file" (3D configuration of 

property-assigned quanta) is saved that has the lowest number of quanta in and is 
closest to the principle alignment. 
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Thus, an average molecule in this implementation is represented by 100 
Q-files, each file consisting of N positive space cubes or quanta of 4.24 A 
resolution having an assigned molecular property characteristic P m (m=l-7). A 
typical Q-file (molecule 6a) is shown in Fig. 4, superimposed upon its 
5 corresponding conformation. The process of optimization of quantization 
parameters is described later. 

A pseudocode description of an algorithm for performing the quantization 
is set forth in Appendix B. 
Mapping 

10 Given molecules which have been rendered into sets of Q-files, each 

quantized conformation can be mapped into the diversity space defined by the set 
of 1.1 x 10 14 theoretical target surfaces. In general, the following process is used. 

For each quantized conformation of each molecule, each of its 24 possible 
XA7Z rotations (6 faces * 4 rotations per face) is fit to each of the 49,268,918 

15 available surface shapes. For a given conformation-to-surface shape fit, if at least 
a user-defined minimum number of negative and positive space cubes overlap (in 
one implementation either 9 quanta or N-2 quanta of a conformation of N 
quanta), and if no quanta of the conformation extend beyond the bounds of the 
surface shape except at the mouth of the surface shape, then the complementarity 

20 of the quantized conformation to all theoretical target surfaces of that shape is 

examined in detail as explained next. If the above conditions are not met, the next 
conformation is examined. 

A score is generated for the complementarity of the given conformation to 
each theoretical target surface of a given shape from based on user-defined 

25 parameters. (The process of optimization of the complementarity parameters is 
described later.) The following complementarity parameters can be used: 

a) A negative parameter for each rotatable bond of the conformation. 

b) If conformational energies are calculated, a negative parameter for the 
energy of the conformation above the lowest energy conformation 

30 from that molecule. 
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c) A positive parameter for the hydrophobic energy gained by removing 
t4 water" from any hydrophobic (Pj) or polarizable (P 2 ) surface face of 
either the conformation or the theoretical surface. 

d) A positive parameter for the hydrophobic energy gained by removing 
5 "water" from any mildly hydrophobic (P 8 ) surface face of the 

theoretical surface. 

e) A positive or negative molecular property interaction parameter for 
overlapping negative and positive space cubes as depicted in Table 2. 

If and only if the resulting score meets a user-defined minimum, then the 
1 0 conformation (and thus the molecule it represents) is said to be complementary to 

the given theoretical target surface. 

The computational advantage inherent in the process of molecule and 

surface quantization is realized in the speed of complementarity checking. 

Whereas a traditional docking program must search a high-dimensional 
1 5 configuration space, the implementations of the invention resolve the problem to 

a framework bounded by 24 possible fitting orientations and a finite number of 

translations. This approximation allows three-dimensional diversity computation 

on a scale that is applicable to very large sets of molecules. 

A pseudocode description of an algorithm for performing the mapping is 
20 set forth in Appendix C. 

The above process results in a complementarity map that consists of a list 

of all theoretical target surfaces to which at least one conformation of a molecule 

is complementary. Comparison of these maps provides a novel method for 

measuring diversity of small molecules. We term the model on which this process 
25 is based quantized surface complementarity diversity (QSCD) because it 

calculates diversity by measuring complementarity to a quantized representation 

of theoretical target surfaces. 

To maintain a computationally efficient complementarity scoring system, 

QSCD makes many approximations of molecular recognition. As explained, 
30 these include cubic units of 4.24 A resolution, gross approximations of surface 

contact area, exactly 4 points of 7 finite types of molecular property 
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characteristics, static theoretical surfaces, and a limited set (up to 100) of low 
energy conformers. Thus, the final complementarity scores are not presumed to 
give precise binding energies for any individual match of conformation to target 
surface. However, taken over all conformations of a molecule and across an 
5 enumerated set of theoretical target surfaces, the scoring system is statistically 
relevant as explained below. 
Model Validation 

To test the validity of the QCSD model, i.e., its ability to predict the 
extent to which two molecules have similar or different binding affinities, eight 

10 sets of test molecules were analyzed (Fig. 5), seven of which were known to have 
binding affinities to seven distinct targets (in addition to a known overlap 
between sets 3 and 4). An eighth set with no known binding affinities was chosen 
with minor atomic and spatial changes to examine the sensitivity of the QCSD 
model at 4.24 A resolution. Known activities of the molecules in Fig. 5 are listed 

1 5 in Table 3 with references. 

Table 3: Pharamacological activities of the molecules used in this study (see Fig. 
5). a)Dohertyetal. J.Med, Chem. 1995, 38, 1259-1263. b) Uehling et al. J 
Med. Chem. 1995,35, 1106-1118. c)Changetal. J.Med. Chem. 1994,57, 
20 4464-4478. d) Chang etal. J.Med. Chem. 1993,36,2558-2568. e) Tsutsumi et 
al. J Med Chem. 1994 37,3492-3502. f) Penning et al. J Med. Chem. 1995, 
35, 858-868. g) Cristalli et al. J Med. Chem. 1995, 35, 1462-1472. h: numbers 
in parentheses indicate IC 50 in ATI subtype assay of series 4. i: numbers in 
parentheses indicate IC50 in AT2 subtype assay of series 3. 





Assay 


IC 50 


or Kj (nm) 


Ref. 


1a 


Binding to Endothelin A Receptor 


400 




a 


1b 


Binding to Endothelin A Receptor 


170 




a 


2a 


Inhibition of DNA fragmentation by 
Topoisomerase I 


28 




b 


2b 


Inhibition of DNA fragmentation by 
Topoisomerase I 


143 




b 


3a 


Binding to AT2 subtype of Angiotensin II 
Receptor 


17 


(0.45) h 


c 


3b 


Binding to AT2 subtype of Angiotensin II 
Receptor 


173 


(31 ) h 


c 


4a 


Binding to AT1 subtype of Angiotensin II 


0.85 


d 




Receptor 
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4b 


Binding to AT1 subtype of Angiotensin II 


1.4 


d 




Receptor 






4c 


Binding to AT1 subtype of Angiotensin II 


1.2 


d 




Receptor 


(23,000)' 




5a 


Inhibition of Prolylendopeptidase Protease 


5 






Activity 






5b 


Inhibition of Prolylendopeptidase Protease 


10.3 






Activity 






6a 


Binding to Leukotriene B4 Receptor 


320 


f 


6b 


Binding to Leukotriene B4 Receptor 


3.2 


t 


7a 


Binding to A2A Type Adenosine Receptor 


6.3 


g 


7b 


Binding to A2A Type Adenosine Receptor 


41.3 


g 


8a 


none 






8b 


none 






8c 


none 






8d 


none 






8e 


none 







The bulk of these molecules have previously been used as part of an in- 
depth study validating molecular descriptor approaches for the prediction of 
molecular diversity within compound classes. This is a more stringent 
5 discrimination than the base criterion sought for the QCSD model, which seeks at 
a minimum to show accurate diversity prediction between compound classes. 

Conformations of all 20 test molecules were "quantized" and then 
mapped onto the basis set of 1.1 * 10 14 theoretical surfaces. Complementary 
surfaces are tabulated for each molecule in Table 4. 

0 

Table 4: Tabulation of surface shapes and total number of theoretical target 
surfaces complementary to each molecule in Fig. 5. 

Complementary Complementary 
surface shapes surfaces 

(shape plus 

functionality) 



1a 376 16,127,687 

1b 379 9,086,768 

2a 27 545,584 

2b 27 416,210 

3a 414 4,970,816 

3b 315 813,024 



WO 00/60507 



PCT/US00/08777 



23 



4a 


487 


4,542,463 


4b 


479 


12,388,826 


4c 


482 


7,595,982 


5a 


337 


2,080,523 


5b 


374 


1 ,966,837 


6a 


220 


192,067 


6b 


186 


153,436 


7a 






7b 


269 


22,367 


8a 


45 


5,561 .654 


8b 


41 


3,959,678 


8c 


333 


17,324,247 


8d 


64 


2,059,546 


8e 


87 


1,343,811 




average: 


4,572,523 



There are many ways to analyze the resulting set of complementarity mappings. 
Because in this case individual molecule comparisons were desired, each of the 
20 mappings was compared pairwise for a total of 1 90 data points. Mappings 
5 were scored in similarity from 0 to 1000 based on a function of the number of 
theoretical surfaces in common: 



Score = SS * FS = ShapeScore * FunctionalityScore 



10 SS= 100* # theoretical target surface shapes common to A & B 
total # surface shapes complementary either to A or to B 

FS — 1 0 * | # theoretical target surface shapes common to A & B with at least 1 set of 4 common \ 
functionalrVies) A tt> J 
1 5 total # theoretical target surface shapes common to A & B 



The first term in this equation gives a percentage measure (0-100) of 
shape similarity between molecules A and B, while the second term gives a 
20 measure from 0-10 of functional similarity per given shape overlap. The 
complete scores are detailed in Table 5 contained in Figure 22. Using this 
scoring system, the maximum score obtainable by very rigid, structurally similar 
molecules is 1000. However, many molecules can only be sampled by an 
examination of up to 1 00 low energy conformations (an average molecule w/5+ 
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rotatable bonds will have at least 3 5 = 243 conformations). Thus, for most 
molecules with more than 100 accessible conformations, similarity scores 
between 0-100 are observed. The scoring constant $> in the equation above 
adjusts the influence of functionality on scoring. A value of 0.33 was found to be 
5 optimal (as discussed below), meaning that shape is the dominant criterion in our 
measure of diversity. A pseudocode description of an algorithm for determining 
either the similarity of two molecules or the similarity of two libraries of 
molecular structures is set forth in Appendix D. 

Fig. 6 shows a plot of all 190 pairings ranked by similarity score. Circles 
1 0 show "heterogeneous" pairs of expected dissimilarity (e.g. 2a, 6b), while squares 
show "homogeneous" pairs of expected similarity (e.g. 2a, 2b). Clearly, the 
QCSD model ranks homogeneous pairs almost exclusively higher than 
heterogeneous pairs; all 15 pharmacologically similar pairs fell within the top 20 
scores out of 190. All homogeneous scores were ranked above 25, while the 
1 5 median score in this experiment was 2.8, showing good "signal to noise." The 
QCSD model is thus a valid predictor of target binding similarity among these 
molecules. 

The pairings also reveal further validation. As might be expected from 
their relative rigidity (low number of accessible conformations) and structural 

20 similarity, the highest scoring pairs are 2a/2b, 8a/8b, and 8d/8e. Furthermore, 
examination of the pairings of 8c with 8a,b,d,e (triangles in Fig. 6, yellow in 
Table 5) yields scores that are within the top 20% of the pairing experiment but 
which are generally lower that the "homogeneous" pairs. This makes sense from 
a target-binding point of view, considering that one face of 8c contains a large 

25 molecular difference (an extra phenyl substituent). To the extent that this face is 
not involved in complementarity to a target surface, the molecules are similar; to 
the extent that this face must be complementary for binding to occur, the 
molecules are quite different. Fig. 7 shows one such case of a surface common to 
both 8a and 8c; the protruding phenyl substituent plays no role in 

30 complementarity. 
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As a rule, there is close similarity of shape and functionality for molecules 
which score 25 or higher. In addition to the pairings that one would expect, 
several other pairs scored between 25-35. When these molecules were examined 
by molecular modeling, significant overlaps were found, suggesting that these 
5 high scores are not just "noise" in the QCSD model. Fig. 8 depicts one such case 
between la and 5a; conformations of la and 5a are displayed that were found in 
the QCSD model to be complementary to the same surface (Fig. 8A, 8B). 3D 
overlays (Fig. 8C) confirm correlation of general shape and 4 points of 
functionality, although they also make clear the limits of resolution of 

1 0 complementarity information using 4.24 A units. As can be seen from Fig. 8, the 
surface in question can detect general shape and functional similarity, but by does 
not provide a basis to predict atom-for-atom overlap between molecules. 

A final result comes from examination of the QCSD model's rankings of 
sets 3 and 4. While set 4 is known to bind exclusively to the ATI subtype of the 

1 5 Angiotensin II receptor, set 3 is known to bind to both the ATI subtype and the 
AT2 subtype. While the QCSD model found high similarity within sets 3 (score 
= 27) and 4 (avg. score = 54), it found an average similarity of 6.9 between 3a 
and set 4 and an average similarity of 3.3 between 3b and set 4 (diamonds in Fig. 
6, green in Table 5 contained in Figure 22). Based on the QCSD model, one 

20 would therefore conclude that while sets 3 and 4 share a limited number of 
complementary theoretical surfaces, they are dissimilar with respect to the 
majority of theoretical target surfaces. This is in fact the case with the AT2 
subtype of the Angiotensin II receptor, to which 4c binds 50,000 times more 
poorly than 3a (see Table 3). 

25 Advantages of the QCSD model 

Having validated the basis set used for the QCSD model in the 
classification of molecular diversity, it must be noted that other models may do as 
well or better in detecting target binding similarity/ dissimilarity between 
molecules. For instance, Fig. 9 and Table 6, contained in Figure 23, show the same 

30 set of 20 molecules ranked by Tanimoto similarity of standard 2D UNITY 

fingerprints (see discussion below). The data demonstrate that the 2D model is 



BNSDOCID: <WO OO605O7A2_l_> 



WO 00/60507 



PCT/US00/08777 



26 

equally capable of predicting pharmacologically similar pairs; UNITY ranks 
similarity between ATI and AT2 subtype binders much higher than our QCSD 
model, although it finds unusually high similarity between 8a and 8c. In general, 
such 2D fingerprint descriptors have been found effective in clustering 
5 pharmacologically similar compounds, and are widely used in determining 
molecular diversity of existing structures. 

Among the advantages of the QCSD model is the value of its negative 
information: The QCSD model determines not only diversity of existing 
structures, but also the structure of non-existing diversity. Given theoretical 
1 0 surface shapes for which no complements exist in a general screening library, 
QSCD allows the design of molecules to fill the given diversity void. 

As stipulated in its formulation, the QSCD basis set is created through a 
reversible process. Although some information resolution may be lost in fixing 
the parameters of a cube's size and functional scope, information content is 
1 5 retained in either direction. Just as a single molecular conformation and 

orientation corresponds to a defined pattern in QSCD space, likewise, a single 
point in QSCD space (within the limits of volume V, resolution R, and N sites of 
functionality P m ) corresponds to a unique 3D shape with a defined 3D array of 
functionality. Given any starting set of molecules, unoccupied points in QSCD 
20 space directly define the molecular shapes and functionalities which those 

molecules do not cover. Thus, a set of detailed 3D molecular templates (at the 
resolution of the QSCD model used) is immediately available for the creation of 
novel molecules. 

As an example, Fig. 10 shows a plot of all of the theoretical surface 
25 shapes covered by all of the conformations of all of the molecules used in the 
example implementation (see Fig. 5). The total volume of the cube in Fig. 10 
encompasses all 49,268 ; 91 8 theoretical surface shapes as listed in Table 1. As 
can be seen from the plot and two expanded points, many theoretical surface 
shapes are "unfilled" by the set of compounds shown in Fig. 5. Thus, in 
30 searching for molecules or libraries to enhance the diversity of the given set of 
compounds, the chemist is presented with a set of actual 3D templates into which 
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new compound libraries may be designed. In comparison, although mapping the 
same set of compounds in a "non-reversible" diversity space would also display a 
set of coordinates to which the molecules map, there would be no way to 
visualize the 3D shape of any point that was not filled by one of the compounds 
5 in the set. Using BCUT values for example (Fig. 1 1 ), the coordinates specified 
for an unfilled point leave the chemist with a set of normalized eigenvalues. 
While these may give an idea of relative abundance of a given functionality (e.g. 
H-Bond Donor) at this point in diversity space, the coordinates give no hint of 
what shape or class of molecules might fill that diversity void. 

10 The above example shows how QSCD is a reversible diversity model with 

respect to molecular shape. Within a given surface shape in QSCD, there are 
many combinations of functionality leading to many different theoretical 
surfaces. If a given library fills only a portion of theoretical surfaces of a given 
surface shape, by following the same process outlined above and in Fig. 10, 

15 unfilled surfaces of specific shape and functionality may be identified and filled 
with complementary libraries. By using data-mining algorithms to analyze and 
intersect the shape and functionality of unfilled surfaces, a minimal set of 
"missing" 3D combinatorial templates can be deduced from the QSCD mapping 
of a given set of general screening compounds. These templates represent the 

20 smallest number of combinatorial syntheses which need to be executed in order 
to fill out the diversity of the set of screening compounds. One such template is 
depicted in Fig. 12. In conjunction with the efficiency of core-based 
combinatorial chemistry, QSCD makes possible the contemplation of a 
"complete" library of screening molecules at a given resolution. The model thus 

25 offers a theoretical and practical answer to the problem of generating lead 
structures for genomic targets of unknown structure and function. 
Technical details of an implementation 

For the example discussed above, molecular conformations were 
generated with Multisearch in Sybyl (version 6.5, Tripos Inc, 1699 S. Hanley 

30 Rd M St. Louis, MO, 63144) on an Rl 0000 Silicon Graphics workstation. 

Conformations were subsequently sorted by energy and conformations within 10 
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kcal of the lowest energy were accepted. Overlay plots of molecules (Fig. 8B) 
were also generated using Sybyl. 

UNITY 2D fingerprints (Unity 4.0, Tripos Inc, 1699 S. Hanley Rd., St. 
Louis, MO, 63144) were generated on an R10000 Silicon Graphics workstation. 
5 Pairwise Tanimoto coefficients were computed as described by Dixon and 
Koehler. 

QSCD software for molecule quantization, mapping of Q-files, and 
surface complementarity display was developed using the Java programming 
language (JDK 1.2) and the Java3D graphics API (version 1.1) on Intel-based 

10 workstations. Theoretical target surfaces were stored and indexed using an 
Oracle 7.3.3 database. 

Parameters for theoretical target surface generation/molecular 
quantization and parameters for complementarity mapping/scoring were 
alternately optimized in three successive rounds as described below. 

1 5 The parameters used for theoretical target surface generation and the 

closely related parameters for quantization of small molecules into quantized files 
(Q-files) were optimized in the context of the algorithms mentioned above. 
Parameters were iteratively optimized by varying a given parameter and then 
quantizing training molecules other than those in Fig. 5. Training molecules 

20 used were taken from in house structures and two published SAR sets. 

Concomitant with molecular quantization, an enumerated set of theoretical target 
surfaces was created with corresponding parameters. Using the current optimized 
complementarity /scoring parameters, molecules were then mapped to theoretical 
target surfaces and all diversity pairing scores generated as described in the text. 

25 Parameters were chosen which accurately predicted known 

homogeneous/heterogeneous pairs and which maximized "signal to noise" of 
homogeneous scores over heterogeneous scores. 

The parameters used for mapping/scoring molecular conformations to 
theoretical target surfaces were optimized in the context of the algorithm stated 

30 above. Parameters were iteratively optimized by varying a given parameter and 
then mapping a constant set of training molecules (see above) to a constant set of 
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theoretical target surfaces, using the most current surface generation and 
quantization parameters. Diversity pairing scores were generated for all training 
molecules, and parameters were chosen which accurately predicted known 
homogeneous/heterogeneous pairs and which maximized "signal to noise" of 
5 homogeneous scores over heterogeneous scores. 

As mentioned earlier, for a given conformation-to-surface shape fit to be 
accepted, the minimum overlap requirement was set to either 9 quanta or N-2 
quanta of a conformation of N quanta. This range allows large conformations to 
fit partially into a theoretical surface (protruding volume must be at the mouth of 
1 0 the surface) while also allowing smaller conformations to be considered for 

complementarity. It excludes large conformations which do not overlap at least 9 
quanta. 

Approximate computational speeds of typical QSCD operations are as 

follows on a single Pentium III 500 MHz workstation: Generation of the basis 
15 set of theoretical target surface used in the study required 17 min.; this data was 

stored for access by subsequent QSCD functions. Quantization of 100 

conformations of a given molecule into 100 Q-files required 250 seconds. 

Complementarity mapping of 100 Q-files onto the basis set of theoretical target 

surfaces used in the study required 40 seconds. 
20 Algorithm for Designing Molecules for Unfilled Target Surfaces 

The following algorithm could be used to design novel molecules based 

on complementarity to unfilled theoretical target surfaces that are not 

complementary to any existing molecular conformations. 

1) Existing molecules are quantized and those negative space cube target 

25 surfaces to which their conformations are complementary are identified, (see 

Appnedixes A, B, and C) 

2) For a given set of existing molecules and a desired set of theoretical target 
surfaces with given shapes and functionalities, those theoretical target 

30 surfaces are identified to which no existing molecular conformations are 

complementary. Novel molecules are designed to be complementary to the 
above identified theoretical target surfaces as follows: 

a) Let the set of those theoretical target surfaces to which no existing 
35 'molecular conformations are complementary = T. 
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b) 'Cluster T into subsets Tl -Tn such that for each Ti there is a central set of 
negative space cubes of given functionality Ci such that all target surfaces 
in Ti can be created by adding up to E additional negative space cubes of 
5 given functionality to Ci at up to each of F points of attachment on Ci 

where each point of attachment Aj (j = 1-F) is a single face of Ci to which 
zero, one or a set of up to E negative space cubes may be added. (See Fig. 
24) 



10 c) For each Ci, design a core molecule Mi that fills but does not extend 

beyond the space defined by Ci within a tolerance limit TOL, the core 
molecule furthermore complementary to the functionality of Ci, and the 
core molecule furthermore containing at least one combinatorial site 
(defined as a reactive site that can be further functionalized and/or 

1 5 extended in a combinatorial step under given chemical conditions) that 

can project potential combinatorial building blocks through at least one 
plane Aj. (See Fig. 25) 

d) Computationally enumerate the combinatorial library L(Mi,B) defined by 
20 Mi and a set of building blocks B that are each no larger in volume than E 

negative space cubes (See Fig. 26). 

e) Quantize n c conformations of each molecule in L(Mi,B) and determine the 
set W of all theoretical surfaces to which any conformation of any 

25 molecule in L(Mi,B) is complementary (see Appendixes B and C). 

f) Compute the set of target surfaces M = WDTn 

g) If M contains an acceptable number of novel surfaces as defined by the 
30 user, then chemically synthesize the actual library L(Mi,B). Otherwise, 

choose a new Mi in step 3 for the given Ci and repeat until conditions in 
this step are met. 



h) Move on to the next Ci in step c. 
35 Protein Diversity 

An extension of any diversity model based on an absolute frame of 
reference is that the same basis set may be used to classify actual proteins. By 
mapping onto the QSCD basis set all surfaces of volume V of a known protein, 
actual proteins can be compared and classified by their 3D binding sites. In 
40 addition to providing a diversity map of known protein binding sites within the 
universe of all theoretical protein surfaces under given parameters, the theoretical 
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surfaces of QSCD may thus be used to correlate protein classes to 
complementary molecular core structures. 

Appendix E describes an algorithm for quantization of protein surfaces. 
Appendix F describes an algorithm for comparing protein surfaces to determine a 
5 degree of similarity or dissimilarity. The following algorithm generates a set of 
files T of quantized protein binding surfaces which together represent the 
available surface of a given protein binding site. 

Algorithm T(protein binding site, H, R, V, TolA, ToIB, Transinc, Rot, 
10 Rotvar, P M ) 

Protein binding site: see 1. below 
H: resolution of scanning grid in angstroms, H<=R 
R: resolution of cube = length of cube side in angstroms 
15 V: total volume of each surface in cubic angstroms 
TolA: tolerance (%) of atomic radii 

ToIB: tolerance (%) of cube volume which must intersect convex hull 
Transinc: translational increment in angstroms 
Rot: # rotations (odd integer) 
20 Rotvar: rotational variance (%) 

Pm: For each of V/R 3 cubes used, any of M types of chemical functionality 

1 . Take a protein binding site file (obtained by any number of commercially 
available methods, such as running a "Connolly surface search" on a standard 

25 "PDB file" (Michael L. Connolly, 1259 El Camino Real, #184, Menlo Park, 

CA 94025) which minimally consists of: 

a) A calculated probable electron density surface of the binding site 

b) A list of all known atom types in the molecule with their coordinates and 
30 atomic radii 

c) A list of known connectivities of all atoms with the type of bond 
connecting each atom 

2. Overlay flexible 2D square grid(s) of grid size HxH angstroms over all 
35 calculated probable electron density surfaces in the protein binding site 



Calculate the convex hull of the set of points defined by the points of each 
grid nexus 



40 4. Examine each grid nexus one at a time 

5. At a given grid nexus place a cube with the center of one face tangent to the 
probable electron density surface of the protein binding site 
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6. If the cube contains a protein atom coordinate or any atomic radii of protein 
atoms protrude into the trial cube by more than Tol % of their atomic radius, 
then step 7., otherwise step 8. 

5 

7. Translate the cube Transinc angstroms away from the grid nexus on the 
probable electron density surface while keeping the center of the cube on a 
line perpendicular to the probable electron density surface at the the grid 
nexus. If the cube is now R angstroms away from the grid nexus, take the 

1 0 next nexus in 4., otherwise return to step 6. 

8. Designate the cube as being a set cube with 5 unchecked faces and 1 checked 
face (the checked face being that which faces the grid nexus being examined). 
Designate the initial frame of reference to be the X, Y, and Z axes co-linear 

1 5 with the sides of the cube. 

9. Rotate the unchecked cube about its center in all combinations (total of Rot 3 
combinations) of the following units: 

20 a) X rotations: any one of Rot units (degrees) from - Rotvar * 90/2 to 

+Rotvar * 90/2 by Rotvar * 90/Rot 

b) Y rotations: any one of Rot units (degrees) from - Rotvar * 90/2 to 
+Rotvar * 90/2 by Rotvar * 90/Rot 

c) Z rotations: any one of Rot units (degrees) from - Rotvar * 90/2 to 
25 +Rotvar * 90/2 by Rotvar * 90/Rot 

10. For a given rotated system from 9., place face contiguous trial cubes of side 
length R at all unchecked cube face(s), not to exceed the nearest integer to 
VYR 3 total cubes. If addition of a new trial cube would exceed the nearest 

30 integer to VYR 3 total cubes, proceed directly to 1 1 without adding further trial 

cubes 

1 1. All unchecked cube faces (not including cube faces on trial cubes) become 
checked cube faces. 

35 

12. If a trial cube contains a protein atom coordinate, or any atomic radii of 
protein atoms protrude into the trial cube by more than Tol A of their atomic 
radius, or if the cube does not intersect with a volume of the convex hull 
equal to at least R 3 * TolB (see 3. above), then the trial cube is removed. 

40 Otherwise the trial cube becomes a set cube (with unchecked faces). 

13. If the total number of set cubes becomes = the nearest integer to V/R 3 
(yielding a resulting cube combination of the nearest integer to V/R 3 cubes), 



then go to step 15. 



45 
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14. If no trial cubes in step 12 became set cubes and there are no unchecked cube 
faces remaining then start with the next rotated cube in step 9. If no rotations 
produce resulting cube combinations in step 13. then start with the next 
translation in step 1. 

5 

15. Ignore any remaining rotations from step 9. Designate all cubes as negative 
space cubes fully enclosed except as detailed below. Designate the layer of 
cubes which is 

1 0 a) perpendicular to the line perpendicular to the probable electron density 
surface at the grid nexus being examined 
b) farthest from the grid nexus 

as negative space cubes which are open at their faces farthest from the grid nexus 
1 5 and perpendicular to the line perpendicular to the probable electron density 
surface at the grid nexus. 

1 6. Based on the types of proximal atoms in the protein surrounding each 
negative space cube, and based upon the bonds which these proximal atoms 

20 form and the other atoms to which these proximal atoms are bonded, ascribe 

to each negative space cube one and only one of M types of chemical 
functionality. 



25 



Types of functionality M may include but are not limited to: 



Acidic regions, 
Basic regions, 

regions of formal charge +1, 

regions of formal charge — 1, 
30 regions of partial charge between +0.5 and +1 , 

regions of partial charge between -0.5 and -1, 

regions of partial charge between 0 and +0.5, 

regions of partial charge between 0 and -0.5, 

hydrophobic regions, 
35 polarizable regions, 

hydrogen bond donating regions 

hydrogen bond accepting regions 

hydrogen bond donating/accepting regions 

40 17. Yield a "quantized" protein binding surface in terms of the nearest integer to 
V/R 3 negative space cubes of side length R with any one of M types of 
functionality per negative space cube. 



45 



18. Return to step 4. for each grid nexus 

1 9. Compare all quantized surfaces from 17. and remove any which are identical 
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20. Yield a set of files T of quantized protein binding surfaces which together 
represent the available surfaces of the given protein binding site 

5 Comparing Molecules to Proteins 

Appendix G describes an algorithm for determining the complementarity 
of a library of molecules to a set of protein surfaces. 



Algorithm for Designing Molecules for a Set of Protein Target Surfaces 

1 0 In another procedure, novel molecules (for testing as ligands for proteins) 

can be designed based on complementarity to negative space cube targets to 
which a set of protein pockets map. The following outline describes the steps for 
doing so: 

15 1 ) A set of existing protein pockets is quantized and those negative space cube 
target surfaces to which their quantizations map are identified. (See 
Appendixes B and C.) 

2) Novel molecules are designed to be complementary to the above identified set 
20 of target surfaces as follows: 

a) Let the set of target surfaces = T. 

b) Cluster T into subsets Tl-Tn such that for each Ti there is a central set of 
25 negative space cubes of given functionality Ci such that all target surfaces 

in Ti can be created by adding up to E additional negative space cubes of 
given functionality to Ci at up to each of F points of attachment on Ci 
where each point of attachment Aj (j = 1-F) is a single face of Ci to which 
zero, one or a set of up to E negative space cubes may be added. (See 
30 Fig. 27) 



c) For each Ci, design a core molecule Mi that fills but does not extend 
beyond the space defined by Ci within a tolerance limit TOL, the core 
molecule furthermore complementary to the functionality of Ci, and the 
35 core molecule furthermore containing at least one combinatorial site 

(defined as a reactive site that can be further functionalized and/or 
extended in a combinatorial step under given chemical conditions) that 
can project potential combinatorial building blocks through at least one 
plane Aj. (See Fig. 28) 
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d) Computationally enumerate the combinatorial library L(Mi,B) defined by 
Mi and a set of building blocks B that are each no larger in volume than E 
negative space cubes (See Fig. 29). 



5 



Quantize n c conformations of each molecule in L(Mi,B) and determine the 
set W of all theoretical surfaces to which any conformation of any 
molecule in L(Mi,B) is complementary (See Appendixes B and C.) 



f) 



Compute the set of target surfaces M = W fl Tn 



10 



g) 



If M contains an acceptable number of novel surfaces as defined by the 
user, then chemically synthesize the actual library L(Mi,B). Otherwise, 
choose a new Mi in step 3 for the given Ci and repeat until conditions in 
step 7 are met. 



15 



Move on to the next Ci in step 3. 



Parameter Values 

Appendix H contains example parameter values useful in connection with 
20 the algorithms described in Appendixes A through G. 

Further Extensions 

As mentioned, a 4.24 A cube was found to be the largest predictive unit 
size of diversity measure for our criteria of designing general screening libraries. 

25 For example, both 4.48 and 4.00 A units gave poorer prediction of 

homogeneous/heterogeneous pairs than the pairings of Fig. 6 (4.24 A units). This 
is likely due to the fact that most organic small molecules are themselves 
quantized by a limited basis set: the VDW radii of H, C, N, O and a few other 
atoms (see for example Fig.l). If there is no constraint on size of cubic units, 

30 however (i.e., if there is no attempt to maximize orthogonality of theoretical 

target surfaces), other unit measures of diversity can be found. A unit of 2.12 A 
should also provide effective diversity information but at a much higher 
resolution. Such a "high-resolution" adaptation of QSCD brings with it 
numerical (and thus computational) challenges. 1 12 negative space cubes (14 x 

35 8) are now required at the upper limit of theoretical target surface size, translating 
to exponentially greater numbers of theoretical target surfaces and, depending on 
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the stringency of fitting parameters, correspondingly greater numbers of surface 
fits per molecule as in Table 4. At this resolution, the assumption of no 
occlusions in theoretical target surfaces becomes less valid, and removal of this 
assumption increases computational complexity further. 
5 A corollary of any absolute diversity model is a prediction of the total 

"size" of diversity space in terms of unique molecular points. In other words, 
what is the minimum set of molecules needed to fully cover a given diversity 
space. This calculation is dependent on two factors: the resolution stipulated in 
the model (e.g., what amount of molecular change is recognized as different) and 

1 0 the maximum values of each dimension of the model's basis axes. In the model 
of QSCD discussed above, resolution is fixed by cubic units of 4.24 A, and 
maximum values are fixed at 14 units (molecular volume of 1070 cubic A) and 4 
points of 7 types of molecular property characteristics. As describe above, the 
result is a set of 1 . 1 * 1 0 14 unique molecular points. Since, using the parameters 

1 5 of this study, an average molecule covers 4.6 million of the unique molecular 
points bounded by QSCD space (Table 4), the model predicts a minimum 1.1 * 
10 14 / 4.6 * 10 6 =24 million molecules would be necessary to completely cover 
diversity space. 

We estimate that an average complementary molecule in the context of 
20 the QSCD model has a AG of complementarity on the order of -1 1 kcal (Table 7). 

Table 7. Summation of binding energies for an interaction of an average 
complementary molecule/theoretical target surface pair in the context of the 

25 QSCD model used herein. An average molecule is assumed to have a buried 

volume of 12 cubic quanta (= 915 cubic A at 4.24 A resolution), 36 exposed faces 
(4.24 A square), 21 non-polar exposed faces (60%), 10 rotatable bonds, 4 points 
of complementary electrostatic/VDW potential, and a conformational energy 
within 2 kcal/mol of ground state. An average complementary theoretical target 

30 surface is also assumed to have 60% non-polar exposed faces. Constants used in 
the table are taken from Ajay and Murcko. 

Energetic contribution Average AG 
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translational/vibrational entropic loss (constant) +9 kcal/mol 

constant +0.7 kcal/rnol (RTln3) per rotatable +7 kcal/mol 
bond; assume rigid theoretical target surface 

AAG conformation from ground state +2 kcal/mol 

constant -0.03 kcal/mol per square A non-polar -23 kcal/mol 
buried surface = -0.54 kcal/mol per 4.24 A 
square non-polar buried face; total 21 molecular 
faces + 2 1 theoretical surface faces 

total interaction from Table 2 -6.0 kcal/mol 
(four complementary points) 

sum of binding energies -1 1 kcal/mol 

binding affinity to nearest integer ( -^1 .363 ) 10" 8 = 10 nanomolar 



In other words, the resolution used to calculate diversity translates roughly to 
nanomolar binding conditions for an average molecule/target surface pair. Given 
5 that some 24 million molecules are needed to completely cover diversity space 
under these conditions, a general screening library guaranteed to contain at least 
one nanomolar binder to any given target of interest would thus number at least 
24 million molecules. This is a large number and will be attenuated by the fact 
that some molecules have significantly more than 1 00 conformations available to 

10 them. However, the QSCD model suggests that if, in the near future, 

combinatorial chemistry and high-throughput screening are to generate initial hits 
primarily in the nanomolar rather than micromolar range, then the field must 
continue to focus its efforts on the development of numerically competent 
synthesis and screening technologies. 

1 5 By defining molecules in terms of their complementarity to a fully 

enumerated set of theoretical protein surfaces under given parameters, and by 
defining actual protein surfaces in terms of their similarity to the same set of 
theoretical protein surfaces, the model allows: 
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1 . Numerical prediction of protein surface diversity (how many different 
types of possible protein surfaces exist under a given set of 
parameters). 

2. Numerical prediction of how many and what type of molecules would 
5 be necessary to create a "universal molecular library" (a library which 

contains at least one complement [of a given score] to any given 
protein surface). 

3. Comparison of the similarity or difference of molecules or sets of 
molecules based on their complementarity to theoretical protein 

1 0 surfaces. The frame of reference for such comparisons is fixed no 

matter how many or what types of molecules are involved. 

4. Numerical prediction of the percent of protein surface diversity to 
which a given set of molecules is complementary. 

5. Comparison of the similarity or difference of actual protein surfaces 
1 5 or sets of surfaces based on their similarity to theoretical protein 

surfaces. The frame of reference for such comparisons is fixed no 
matter how many or what types of protein surfaces are involved. 

6. Numerical prediction of the percent of protein surface diversity to 
which a given set of actual protein surfaces is similar. 

20 7. Prediction of the actual protein surfaces to which a given molecule is 

complementary. 

8. Prediction of how many and what type of molecules would be 

necessary to create a "universal molecular library" against a given set 
of actual protein surfaces (a library which contains at least one 
25 complement [of a given score] to each actual protein surface). 



This application discloses information from an article which has been 
accepted for publication in the peer-reviewed Journal of Medicinal Chemistry 
and is currently scheduled for publication in the May 1 8 th issue. The article, 
30 listed by the Journal of Medicinal Chemistry as JM990504B, is titled "Quantized 
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surface complementarity diversity (QSCD): A model based on small molecule- 
target complementarity," and is incorporated herein by reference. 

Other implementations are within the scope of the following claims. 
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Theoretical Surfaces 

A surface opening O is a set of lattice squares in Z 2 denoted by their comers: 

O = {{xuyi) € Z 2 ,t = l...m} 

By definition, a surface opening must be connected. For connectivity purposes, two 
points p, q € Z 2 are neighbors iff 

|Px - 9x1 + \p y - q y \ = 1 

The area of a surface opening is defined to be the number of lattice squares it contains. 
Surface openings are considered to be unique upto translations and rotations of the x-y 
plane. 

A surface shape S is a set of "negative space" cubes represented as lattice cubes in 
Z 3 denoted by their corners: 

5= {(a^yi.Zi) € Z 3 ,z = l...n} 

By definition, a surface shape must satisfy the following conditions: 

• All cubes are below the x-y plane. That is, if (x, y t z ) € 5, then z < 0. 

• The surface shape is connected. For connectivity purposes, two cubes p, q € Z 3 
are neighbors iff 

IPx - 9x| + |p y - q y \ + |p* - q z \ = 1 

• There are no occlusions along the z-axis. That is, if (x,y, z) € 5 and z < 0, 
then (x r y,z + 1)6 5. 

The volume of a surface shape is defined to be the number of lattice cubes it contains. 
Surface shapes are considered to be unique up to translations and rotations of the x-y 
plane. 

Every surface shape specifies a unique opening: 

opening(S) = {(x, y)|(x,y, z) € 5} 
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APPENDIX A. THEORETICAL SURFACES 

Further, given a surface opening O and a function d : O — ♦ N, a unique surface shape 
with the given opening is defined: 

shape(0,ci) = {(x,y,2)|(x,y) € Q,d(x,y) > -z} 

The function d specifies the "depth" of the surface shape at each opening point. Sample 
surface shapes and openings can be seen in Figure 13. 

A theoretical surface consists of a surface shape where the cubes in the surface 
shape are each associated with functionality. The set T of seven specific types of 
characteristic functionality is used: 

• T\ : Potential Negative Charge 

• ?x> Potential Positive Charge 

• Tz: Hydrogen Bond Donor/ Acceptor 

• T 4 Hydrogen Bond Donor 

• P 5 : Hydrogen Bond Acceptor 

• T$\ Polarizable 

• Hydrophobic 

• T*\ Slightly Hydrophobic 

A functionality map / : 5 — * T defines the assignment. By default, all cubes are 
assigned functionality Ji, and alt possibilties are considered where upto n/ of the 
cubes are given one of the functionalities T\-Ti. 

Generating all surface shapes is accomplished by the following steps: 

1. Generate all surface openings (S URFACEOPENINGS). 

2. Filter the surface openings to remove openings that are unlikely to resemble 
surfaces found in nature (OPENINGFlLTER). 

3. From the set of filtered openings, generate all surfaces shapes whose associated 
openings are in the set (S URFACESHAPES). 

4. Filter the set of surfaces shapes to remove shapes unlikely to resemble surfaces 
found in nature (SHAPEFlLTER). 

5. From the set of filtered surface shapes, add all possible combinations of func- 
tionality to define a set of theoretical surfaces (FUNCTIONALIZESURFACE). 



BNSDOCID: <WO 0060507A2 l„> 



WO 00/60507 



42 



PCT/US00/08777 



APPENDIX A. THEORETICAL SURFACES 



Algorithm A.l SurfaceOpeN!NGS(>1): calculate all surface openings with area less 
than or equal to A. 

i: define Mi to be a set containing the only the surface opening with a single square 

at (0, 0) 
2: for i <— 2 to A do 

3: Mi *— 0 {Mi will ultimately contain all openings of area t} 
4: for all O e Mi-i do 

5: define V to the set of all possible openings obtained by adding a single square 

to O adjacent to a square already present in O 
6: for aU P € V do 

7: if no rotation or translation of P is present in Mi then 

8: add P to Mi 

9: end if 

10: end for 
li: end for 
12: end for 

14: return (O) 
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APPENDIX A. THEORETICAL SURFACES 



Algorithm A.2 OpeningFilter(0, A t , M nc > M c ): filter a set of surface openings 
O using area-threshold parameter A ti max -non-central parameter M nCj and max- 
contiguous parameter M c . 

I: (5 — 0 

2: for aU O € O do 

3: 6 <- O 

{delete from 6 all lattice squares with exactly one neighbor that has 2 or more 
other neighbors} 

4: for all (x,y) € O do 

5: N — O n {(x - 1, y), (x + l t y), (x,y - 1), (x, y + 1)} 

6: if |iV| = 1 then 

7: define (x, y) to be the single element in N 

8: N — 6n{(x- l,y),(x + l,y),(x,y- l),(x,y+ 1)} 

9: if |7v*| > 2 then 

10: remove (x,y) from 0 

ii: end if 

12: end if 

13: end for 

{delete from O all 2x2 blocks of lattice squares in O} 

14: for all (x ? y) e O do 

15: if (x + l,y) € O and (x,y + 1) e O and (x + l,y + 1) € O then 

16: remove (x,y), (x -h l,y), (x : y + 1), (x + l.y + 1) from 6 

17: end if 

IS: end for 

19: define c to be the number of squares in the largest connected component left in 

6 

20: if area(O) < M nc and (area(O) < A t or c < M c ) then 
21: add 6 to O 
22: end if 
23: end for 
24: return (O) 
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Algorithm A.3 SURFACES H APES (C?, V): calculate all surface shapes with openings 
in the set O and volume less than or equal to V. 



I 

2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
14 
15 
16 
17 
18 
19 
20: 
21 
22 



for aUO^Odo 

define d : O — ► N such that d(x, y ) = 1 
So *— 0 {So will contain all surfaces with opening O} 
v «— area(O) 
while v < V do 
5 «- shape(0,d) 

if «So contains no translations or rotations of 5 then 

add S to So 
end if 

for x « area(O) to area(0),y < area(O) to area(O) do 

if (x,y) € O and v + 1 < V then 

v — v + 1, d(x, y ) «— d(x, y ) + 1 

goto step 5 
else 

v «— v - d(x, y ) + 1 . d(x, y ) 4— 1 
end if 
end for 

goto step 1 {no more surfaces left with this opening } 
end while 
end for 

return {14) 



Algorithm A.4 ShapeFilter(«S, M e ): filter a set of surfaces S using max-extrusion 
parameter M e . _____ 



5 — 0 

for all S € S do 

O «— opening(S) 

d 4— depth(S) {corresponding depth function} 
for all (x,y) € O do 

for all (x r y) € {(x - l,y),(x+ l.y),(x,y - l),(x,y + 1)} do 
if (x, y) € O and d(x, y) > d(x, y ) - M e then 

goto step s 
end if 
end for 

goto step 2 {surface does not qualify } 
end for 

add 5 to S {surface does qualify} 
end for 
return (S) 
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Algorithm A.5 FUNCTlONAUZESURFACE(S, n f ): Return the set of theoretical sur- 
faces defined by surface shape 5 with n / points of specific functionality attached. 
S 4— 0{ set of functionalized surfaces } 

define V to be the set of all the f^™* s >) selections of n/ cubes from S 
for aDVeVdo ' 



define / : 5 — T such that f(s) = F 8 if s £ V, f(s) = ?i for s € V 

number the elements of V as v\ , . . . , v nf 

loop 

add (S,/)to5 
for t t— 1 to n/ do 
if/(^)^^7then 

assign f(vi) to be the next higher functionality 
goto step 6 
else 

end if 
end for 

goto step 3 {no more functionality assignments left with the set V of cubes} 
end loop 
end for 
return (S) 
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Quantization 



In the quantization process a molecule is reduced into a representation such that com- 
plementarity can be calculated against a set of theoretical target surfaces. Quantization 
takes place in the following steps: 

1 . Each atom in the molecule is assigned a functionality based on its type and con- 
nectivity. 

2. Three dimensional conformations of the molecular structure are generated. 

3. Each conformation is converted into "positive space** cubes based on the posi- 
tions of its atoms. 

4. Each cube is assigned a functionality based on the atoms that it contains. 

Finally, the quantized form of the molecule is defined to be the set of all selected 
conformation quantizations with functionalities assigned. The quantization process is 
summarized in Figure 14. 
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B.l Atomic Functionality Assignment 

Given a molecular structure as a set of atoms a map / : M — T is defined 
assigning functionalities to all of the atoms. The map is defined by the using a set 
of rules to match molecular substructures based on extended atom types (as generated 
by Tripos, for example) and bonding patems that encapsulate each functionality type. 
The algorithm keeps track of atoms that are excluded from matching a lower priority 
rule because they have already been matched in a higher priority rule, where T\ has 
the highest priority and F 7 the lowest. Atoms not matching any rule are assigned 
functionality T 7 . No atoms are assigned functionality The functionality rules used 
can be seen as follows: 

• F\\ Figure 16 

• Ti \ Figure 17 

• T z \ Figure 18 

• T\\ Figure 19 

• ? h \ Figure 20 

• ?§\ Figure 21 

Within each functionality type, the functionality rules are search sequentially in the 
order listed in Figures 16-21. Figure 15 provides a legend for the symbols used in the 
functionality rule diagrams. 

Algorithm B.l ATOmFunCTIONALMap(.M): assign functionality to the atoms in 
molecular structure ,M and de termine which atoms are excluded from quantization. 
i: £ c <— 0 {£ c is the set of atoms that are excluded from consideration. } 
2; £ q «— 0 {£ q is the set of atoms that are excluded from the rest of the quantization 
process. } 

3: define f : M — T such that / (a) = for all atoms a € M. 
4: for i ♦— 1 to 6 do 

5: for each functionality rule for Pi considered in the proper order do 
6: find all matches in M. with consideration to S c 

7: according to the rule, add appropriate atoms to £ c and S q \ and set / to Ti for 

selected atoms 
8: end for 
9: end for 
10: return (f,£ q ) 
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B.2 Conformation Generation 

A conformation is a mapping c : „V1 — ► IR 3 of the molecular structure into three 
dimensional space. Using OpenEye Omega software (Open Eye Scientific Software 
Inc., 335c Winische Way, Santa Fe, NM, 87501), upto n c representation conformations 
for each molecule are generated within given rule-based energy parameters. 
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APPENDIX B. QUANTIZATION 

B*3 Base Coordinate Frame 

A coordinate frame T = (R. t ) : R 3 — > R 3 is a rigid motion of the space defined by a 
rotation R e SOa(R) and a translation t € R 3 that transforms a point p by the rule: 



Given a resolution r (the length of a side of a lattice cube) and a coordinate frame, a 
lattice on R 3 is implicitly defined by 

q 6 Z 3 ~ [rq xi rq x + r)x [rq yi rq y + r) x [rq zy rq z + r) C R 3 

The base coordinate frame is generated from a conformation of molecular structure 
M. First, a subset Ai of the atoms in M are selected via FRAMEATOMS. These are 
atoms in or near ring structures close to the center of the conformation. A ring atom 
is defined to be an atom that contains at least one bond which, if removed, would not 
result in the molecule being disconnected. If there are an insufficient number of ring 
atoms, all atoms suflficently close to the center of the conformation are used. 

Then, the base coordinate frame is calculated in B aSEFrame. The x-axis of the 
base coordinate frame is defined to to be the solution to the optimization problem: 



is the center of the conformation. Due to the non-linear nature of this problem, it is 
solved using an approximate gradient descent method. Given x, the y-ax\s of the base 
coordinate frame is defined to be the solution of a second optimization problem: 



Tip) = Rp + t 



Hx|| = l 



max 



£ \x T (c(a)-p)\ 



where 





Given x and y, z is uniquely specified. The base coordinate frame then simply involves 
centering the conformation by translating p to the origin, and using the new x, y, and z 
axes. 
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Algorithm B.2 FrameATOMS(.M, c, 77, r m , g r ): select a set of atoms to use for gen- 
erating the coordinate frame from a molecular structure M and a conformation c. The 
following parameters are used: ring factor r f , ring minimu m r m , radius factor q r . 

l: 1Z *-d {H will hold the set of ring atoms } 

2: for all a € M. do 

3: if a is not a Hydrogen atom and a is in a ring then 
4: add a to H 
5: end if 
6: end for 

7: P pC?T Z fl e^ c ( a ) {the mathematical center of the conformation } 
8: r +— max o€ M ||c(a) — p|| {the maximum distance from the center of the confor- 
mation } 

9: C «— 0 {a placeholder for atoms in rings we have already examined } 
10: 7l g <— 0 {a set of rings which are close to the center of the conformation } 
II: for all a € 71 do 

12: if \\c(a) - p|) < rq r and a £ C then {if a ring atom is sufficiently close to the 

center of the conformation, add all members of its ring group } 
13: Q <— RlNGGROUP(a, 7£,77) 

U: add Q to 7L g 

15: C<—CUQ 

16: end if 

17: end for 

18: T> <— {a € M such that||c(a) - p|| < rq r ) {a default set of atoms close to the 

center, to be used if we haven't collected enough ring atoms } 

19: if 7Z g = 0 then {if there are no ring groups, use the default } 

20: return (£>) 

21: end if 

22: if max£ € 7^ 9 \g\ < r m then {if the biggest ring group is smaller that r m , use the 

default} 

23: return (£>) 

24: end if 

25: B g *- Us € tc 9 ,|£|>6 5 {a set of "big" ring groups, having at least 6 atoms } 

26: M — 0 

27: for all a 6 \JgeB 9 $ do { use a ^ "big" ™B group atoms and any neighboring 
atoms } 

28: add a and all atoms bonded to a to A4, if not already present 

29: end for 

30: return (M) 
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Algorithm B.3 RiNGGROUP(a. 77): calculate the set of atoms that can be reached 
starting at atom a and crossing over at most r / atoms that are not in the set of ring 

atoms TZ. 

\: d * — 0 

2: T h - {a}, T n - 0 

3: £-{a}.ff - {a} 

4: while d < Tf do 

5: if T h = 0 then 

6: d — d + 1 

7: T h — T n , T„ «- 0 

8: else 

9: d<-pop(7^) 

10: if a is not a Hydrogen atom and a £ £ then 

ii: for all atoms o bonded to a do 

12: if o is not a Hydrogen atom and o £ £ then 

13: add o to £ 

14: if o € then 

IS: add o to G 

16: add o to 7^ 

17: else 

18: add o to T n 

19: end if 

20: end if 

2 1 : end for 

22: end if 

23: end if 

24: end while 

25: return (Q) 
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Algorithm B.4 BaseFrame(A1. c): compute the base coordinate frame given a set 
of atoms M. and a conformation c. 

»: P <- jjtii £« 6 .M C(a) 

2: define u : M -+ R 3 by u(a) = c(a) - p 

3: dj 4- (0,0,0) 

4: for all a € do 
5: v - u(a)/|j U (a)N 
6: loop 

7: 5 — E o6 ^ sign(v r u(o))tz(o) 



8: S^S/\\S\\ _ 

9: for Mo€ M do 

10: if sign(s r u(o)) ^ 0 and sign(s r u(o)) ^ sign(u r u(o)) then 

U: v — 5 

12: goto Step 6 

13: end if 

14: end for 

15: V «— S 

16: goto Step 18 

17: end loop 

I8: if Hae* \v T v(o)\ > £ Q€ ^ |df u (o)| then 

19: <f a <— V 

20: end if 
2 1 : end for 

22: d 2 «- (0,0,0) 

23: for all a € M do 

24: t; «— d x x (u(a) - dfu(a)di) 
25: t; - v/||v|| 

26: s — Hc€M sign(v r u(o))u(o) 
:7; s «_ (5 _ £^ 5 di)/||« - df sdall 

28: if sign(s T u(o)) = sign(v r u(o)) for all o e M such that sign (s T u(o)) ^ 0 
then 

29: if ZaeM l* T *(°)l > HaeJCi \<$u(o)\ then 
30: c*2 «— 5 

31: end if 
32: end if 

33: end for 

34: R — [d 1 ,d 2) di x d 2 ] T 

35: return (R, -Rp) 
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BA Lattice Centering 

By our definition above, the lanice defined by a coordinate frame places comers of 
cubes at points all of whose coordinates are integer multiples of r. Given a particular 
conformation, it may be better to shift the lattice by a length of r/2 in a particular 
direction, recentenng the lattice cubes. 

Algorithm B.5 CenterFrame(7\ M, c,c f , c t ): recemer the coordinate frame T for 
the set of atoms M. and conformation c using parameters: centering fraction c/, reso- 
lution r, centering tolerance c t . 

lr / «- \c f \M\] 

2: set x/ to be the hh lowest element in the set {|T(c(a)) x |, a € M} 
3: set yt to be the 2th lowest element in the set {\T{c{a)) y \ 1 a € M} 
4: set zi to be the /th lowest element in the set {|T(c(a)) 2 |, a € M} 
5: A+-{aeM such that \T(c(a)) x \ < x<, \T(c(a)) y \ < yj, \T(c(a)) z \ < z t } 









r/2, 


r/2) 


V 2 




(0, 


r/2, 


r/2) 


"3 




(r/2. 


0. 


r/2) 


«4 




(0, 


0, 


r/2) 


t>5 




(r/2, 


r/2, 


0) 


V6 




(0, 


r/2, 


0) 


Vl 






0, 


0) 


Vg 




(0, 


0. 


0) 



7: for t *— 1 to 8 do 

8: define the coordinate frame T»(p) = p + v, 
9: «— CUBIFY(.M, c, J* o^r.c,) 

io: m <- |Qi| 

11: d x .i «— rnax^gQ, g x - min^gQ^ g x 
12: d y ,t <— max^eq, q y — min g€ Q < g y 
13: d z j <— max^gQ^ 9z - mm q€ Q . 
14: end for 

15: define j to be the index of the least member of the set {(n*, d z . x , <f y ,t, d z ,i}, where 
comparisons are done using a dictionary ordering (that is, compare the first com- 
ponent, if case of equality compare the second component, etc.) 

16: return (Tj) 
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B,5 Lattice Perturbations 

The base coordinate frame is not necessarily optimal for quantization, so a set of 
"close" frames are also examined. By parameterizing the space of coordinate frames 
S03(R) x R 3 into 3 rotational dimensions and 3 translational dimensions and taking 
n r equal steps in each rotational dimension and n t equal steps in each translational di- 
mension, a set of n 3 n 3 perturbation frames is built. These can later be composed with 
the base coordinate frame to give the desired set. 



Algorithm B.6 PerturbationFrameS^, v t , n r , v r ): generate a set of perturbation 
coordinates frames using parameters: resolution r, number of translations n t , transla- 

tional variance v tt number of rotations n r , rotational variance v r . 

1: T — 0 

2: for i x «— 0 to n r — 1 do 

3: define R* to be the coordinate frame corresponding to rotation about the x-axis 

by 7ru r (2i x + 1 - n r )/(4n r ) radians 
4: for i y <— 0 to n r — 1 do 

5: define R y to be the coordinate frame corresponding to rotation about the y- 

axis by 7rv r (2i v + 1 - n r )/(47i r ) radians 
6: for i z <— 0 to n r — 1 do 

7: define R z to be the coordinate frame corresponding to rotation about the 

z-axis by 7rv r (2z* + 1 - n r )/(4n r ) radians 
8: for j x «— 0 to n t — 1 do 

9: define the coordinate frame 

r«(p) =P+(rt;<(2i x + l - n r )/(2n r ),0,0) 

10: for jy «— 0 to n t — 1 do 

1 1 : define the coordinate frame 

T y (p) =p+(0,rv e (2j„ + 1 - n r )/(2n r ), 0) 

12: for j z — 0 to n t — 1 do 

13: define the coordinate frame 

T y (p) = p + (0,0,^(2^ + 1 - n r )/(2n r )) 

14: add T z o T y o T x o R z o Ry o R x to T 

IS: end for 

16: end for 

17: end for 

18: end for 

19: end for 

20: end for 

21: return (T) 
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B.6 Cubification 

Given a conformation and a lattice (defined by a coordinate frame and resolution), 
cubification is the process of determining which lattice cubes are filled by the confor- 
mation. The set of lattice cubes is constructed by taking any cube in which an atom 
center in the conformation directly falls and also cubes which are sufficiently close to 
the van Der Waals sphere of an atom. 



Algorithm B.7 CUBlFY(A / (, c,T, r,t): quantize the conformation c of the molecular 
structure M into a set of cubes defined on the coordinate frame T using parameters: 

coordinate frame T, resolution r, tolerance t. 

l: Q «— 0 {Q will contain the set of cubes in the quantization. } 

2: define a function d : M — ► R that takes each atom to its van Der Waals radius 

3: for all a € M do 

4: p «— T(c(a)) {the transformed position } 

5: q «— ([px/rj, |_p y /rj, \p z /r\) {the integer coordinates of the cube into which 

the atom falls } 
6: if q £ Q then 
7: add q to Q 
8: end if 

9: define Q to be the set of cubes neighboring q: 

Q^{q<Ll? such that max (\q x - q x l \q y - q y \, \q z - q : \) = l) 
10: for all q € Q do 

1 1 : define d to be the distance of the point in R 3 in the cube defined by q that is 
closest to p: 

d «— min \\p — v\\ 

v€ [rq x ,rq x +r) x \rq v ,rq v +r) x [rq M .rq t +r ) 

12: if d — d(a) < tr and q £ Q then 

13: add q to Q {add a neighboring cube if it is too close to the van Der Waals 

sphere of the atom} 
U: end if 
15: end for 
16: end for 
17: return (Q) 
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B.7 Entire Process 

Given a set of conformations, in Q UANTIZE we see the entire quantization algorithm. 
For each conformation: 

1 . a base frame and centering frame are calculated 

2. perturbations of the base frame are used in order to find the lattice that results 
first in the smallest number of cubes and second with the least distance from the 
base frame 



3. using the atomic functionality map, functionalities are assigned to the cubes in 
the minimal quantization 
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Algorithm B.8 Quantize( A4, C,r, t.r/, r m ,q T ,Cf,c u n t , v tl n r , v r ): quantize the 
set of conformations C for the molecular structure M with parameters: resolution r, 
tolerance ring factor 77, ring minimum r m , radius factor q r , centering fraction c/, 
centering tolerance c lt number of translations n t , translational variance v u number of 
rotations n rt rotational variance v ry polarizable minimum p m . 

1: — ATOMFunctionMap(.M) {build a functionality map and a list of 

atoms to be excluded from quantization } 

2: «S «— 0 {will hold the resulting quantizations } 

3: for all c € C do 

4: M <— FRAMEATOMS(A^ - £ qj c, r f , r m , q r ) 

5: T b «— BaseFrame(jM,c) {base coordinate frame} 

6: T c <- CenterFrame(7\ M - £ 7 , c, c fl c t ) {lattice centering } 

7: Qm «— 00 {smallest number of cubes seen so far } 

8: dm +— 00 {smallest transform distance seen so far } 

9: T <— PERTURB ATlONFRAMES(n t) u t ,n r , v r ) {set of perturbation frames} 
10: for all T p e T do 

11 : T «— T c o T p o Tb {the total coordinate transform } 

12: Q CUBl¥Y(M - £ qf c,T,r,t) {cubes given this transform} 

13: q <— I Q| {number of cubes} 

14: d 4- £ 0 €A<-£, \\T P (c{a)) - c(a) || {transform distance} 
15: if \Q\ < qm or (\Q\ = q m and d < d m ) then 
16: 9m — <J,d m <- <i, Q m <— Q, T m — T 

17: end if 
18: end for 

19: define / m : Sm — ^ as f m {q) = ,F 7 . 
20: for aU a € - € q do 

21: <? - ([Tm(c(a)) x /rl LTm(c(a)) y /rJ, |Tm(e(a)),/rJ) 

{g is the the cube a falls into} 
22: if /(a) has higher priority than / m (<?) then 
23: if /(a) ^ ^ 6 then 

24: / m (<?) - /(a) 

25: else if at least p m atoms with functionality are in q then 

26: / m (<?) - /(a) 

27: end if 

28: end if 

29: end for 

30: add(Q m ,/ m )to5 

31: end for 

32: return (5) 
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Appendix C 

Surface Complementarity 



In order to view molecules in the space of theoretical surfaces, we must establish com- 
plementarity between quantized conformations and theoretical surfaces. This is done 
in FitSurfaces as follows: 

1 . all 24 possible orientations of each quantizaed conformation are considered 

2. for each orientation, the quantized conformation is shifted down and below the 
plane 

3. a set of surfaces which fit each shifted conformation are detected 

4. functionalities for each surface are computed such that the binding energy be- 
tween the surface and the quantized conformation are favorable 
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APPENDIX C. SURFACE COMPLEMENTARITY 



Algorithm C.l FlTSURFACES( Q, / q , E cy r b , . . .): calculate all surfaces with func- 
tionality that are complementary to the quantizated conformation Q with function- 
ality map /„, conformational energy E Ci and r& rotatable bonds using the following 
parameters: minimum surface opening area A, maximum surface volume V, area- 
threshold A tt rnax-non-central M nc » max -contiguous jV/ c , max-extrustion M t > num- 
ber of points of characteristic functionality n/, minimum energy Emin, minimum fit 
quanta qmin, minimum slackness s m , n , maximum slackness Smox» maximum protru- 
sion levels pmaxt translanonal-rotational-vibrational entropy E tr v> rotatable bond co- 
efficient o, hydrophobic energy coefficient ca, hydrophobic surface energy coefficient 
c 3 , potential function P. 



1:5*— 0{will hold the resulting surfaces } 
2: define 71 to be the set of 24 lattice rotations 
3: for all R e 7Z do 

-i: Q *— {Rq,q € 2} {rotate the quantization by R} 
5: t x — Umax^ Q= + min «j€Q 9*)/ 2 J 
6: t v — Umax^Q q y + min, e6 <7 y )/2J 
7: t = «- min g€6 

8: Q *- {(q x - t x ,q y ~ t v ,q x - t z ),q e Q} {recenter over the x-y plane} 
9: for all d *— 0 to max fl€ Q q z do 

10: Qd «— {(o~xi9y.9* — d),q € Q such that q z < d} {shift the quantization 

down and only keep cubes on or under the x-y plane} 
11: if \Qd\ < min^min, \Q\) — Smin then {skip if not enough cubes are below 

the plane} 
12: goto Step 9 

13: end if 

14: If max g€ g rf q z > Pmax then {skip if too many layers of cubes are sticking 
out of the plane} 

15: goto Step 9 

16: end if 

17: S d «- CORESURFACE( Q d ) 

18: A d «- min(s max + (5^1,-4), V d «— min(s mQI + \S d U V) 

19: for all S € DETECTSURFACES(S d Mdi VdMtr A/ nc , A/ c , A/ e ) do 

20: for all (S/./s) 6 FUNCTIONALlZESURFACE(5,n/) do 

21: if ENERGY(«S/ t / 3 , Q q , f q , E c <r b , r,E trV! c r . c ht c s , P) > Em iTl then 

22: if no translation or rotation of (5/, f s ) is in S then 

23: add (5/,/ a ) to S 

24: end if 

25: end if 

26: end for 

27: end for 

28: end for 

29: end for 

30: return (S) 
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Algorithm C.2 CoreSurfaCE(Q): calculate the minimal surface fitting a quantized 
conformation. 

\: O <— ©{surface opening} 

2: for all q € Q do 

3: add (q XJ q y ) to 0, if not already present 
4: end for 

5: define a depth function d : O — » N such that d(x t y) = 1 
6: for all q € Q do 

7: d(q x ,q y ) — max(d(<? x , sj, 1 - g z ) 
8: end for 

9: return (shape(O.d)) 
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APPENDIX C. SURFACE COMPLEMENTARITY 



Algorithm CJ DETECTSURFACES(S Cl A V; A t , M nc < A/ c , M e ): detect additional 
surfaces by adding cubes to S c subject to parameters: minimum surface opening 
area A, maximum surface volume V, area-threshold A u max-non-central M ncj max- 
contiguous M c , max-extrustion M e . 



I: *S «— 0{the detected surfaces } 

2: S n - {S c } 

3: while *S n # 0 do 

4: 5 m — pop(«S n ), O «- opening(S m ), d m — depth(S) 

5: if O is connected and area(O) < A then {add surfaces obtained by adding 

cubes below S m } 

6: d «— dm, v «— volume(S m ) 

7: while v < V do 

8: 5 «— shape(O t <i) 

9: add 5 to 5 

10: for x < area(O) to area(O), y < axea(O) to area(O) do 

ii: if (x y y) € O and t>+ 1 < V then 

12: d (x, y) «— d(x, y) + 1 

13: goto Step 7 

14: else 

15: vi-v- d(x,y) + d m (x, y), d(x, y) d m (x,y) 

16: end if 

17: end for 

18: goto step 20{no more surfaces left } 

19: end while 

20: end if 

21: if area(O) < A and volume(5 m ) < V then {consider surfaces made by en- 
larging the opening } 

22: define V to be the set of all possible openings obtained by adding a single 

square to O adjacent to a square already present in O 

23: for all P € V do 

24: define d v : P — * N such that d p {x.y) = dm{x,y) for (x ; y) G O and 

d P (x,y) = 1 otherwise 

25: add shape(P, d p ) to S n 

26: end for 

27: end if 

28: end while 

29: O — OPENiNGFlLTER({opening(S),5 e S},A t ,M nc >M c ) 

30: for all S £ «S do {filter surfaces based on openings} 

31: if opening (S) £ O then 

32; delete 5 from 5 

33: end if 

34: end for 

35: 5 «— ShapeFilter(*S, M e ){fiker surfaces based on shape} 

36: return (S) 
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APPENDIX C. SURFACE COMPLEMENTARITY 

C.l Energy 

Complementarity energy between a quantization of a molecular conformation and a a 
cubic theoretical surface is the sum of several components: 

• translational-vibration-rotational entropy (a constant) 

• a conformational energy term, representing the energy of this conformation rela- 
tive to the minimal energy conformation (calculated by the conformational gen- 
erator) 

• a term proportional to the number of rotatable bonds 

• potential energy, the sum of energy due to interactions between the functionali- 
ties of overlapping negative space surface cubes and positive space quantization 
cubes (represented as a function P : T x T -+ R U {— 00} ) 

• hydrophobic energy due to the exclusion of water from cubes in the surface, 
proportional to the surface area from which water is excluded 
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APPENDIX C. SURFACE COMPLEMENTARITY 



Algorithm C.4 ENERGY(S, f aj Q, f qt E cy r 6 , r, E trv , c r ,c/ lt c 3 , P): calculate the com- 
plementarity energy between a surface 5 with functionality map f 3 and a quantized 
conformation Q with functionality map f q , conformational energy E c , and r b rotatable 
bonds using parameters: resolution r, translational-rotational- vibrational entropy E trv> 
rotatable bond coefficient c r , hydrophobic energy coefficient Ch, hydrophobic surface 
energy coefficient c 3 , potential function P. 

l: E r «— c r rt,{ energy due to rotatable bonds } 

2: E p 0 {energy due to potential interactions } 

3: for all s 6 S do 

4: if 5 € Q then 

5: E p ^E p + P{f q {s),f 9 (s)) 

6: end if 

7: end for 

8: Eh «— 0 {energy due to hydrophobicity } 
9: for all s 6 5 n Q do 
10: define 

T <— { (s x + l,s y ,s x ),(s x - l,5 vi a«), 

{s Xt s yi s x - 1) } 

11: a r 2 |Tn 5|{the area of contact between the surface and the quantized con- 
formation at this point } 

12: if f s (s) = «F S then {hydrophobic energy due to slightly hydrophobic surface 
cubes} 

13: E h — E h + c h c s a 

14: else if f s (s) € {^e.^V} then {hydrophobic energy due to non-polar surface 
cubes} 

15: Eh <— Eh + ChCt 

16: end if 

17: if f q (s) € {^6» -T 7 ?} then {hydrophobic energy due to non-polar quantized con- 
formation cubes} 
18: Eh ^ Eh + c h a 
19: end if 
20: end for 

21: return (E trv + E r + E p + E h - E c ) 
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Appendix D 



Molecular Library Comparison 



A library is a set of molecular structures. Given a library, the set of complementary 
theoretical surfaces is defined as the union of all surface shape/functionality pairs com- 
plementary to any quantized conformation of any molecule in the library. 

The algorithm LibraryComparE calculates a score proportional to the similar- 
ity of the two libraries. The score is calculated by representing each library as its set 
of complementary theoretical surfaces, and using the S IMILAR1TYSCORE primitive to 
determine the similarity or dissimilarity two sets of theoretical surfaces. If the molec- 
ular libraries each contain only one molecule, then the algorithm calculates a score 
proprotional to the similarity of the two molecules. 



Algorithm D.l LibraryCompaRE(£i, £2): calculate a similarity score between 0 

and 1000 for two molecular libraries, 

I: define Ti to be the theoretical surfaces complementary to L x 
2: define to be the theoretical surfaces complementary to £2 
3: return (SlMILARITYSCORE(T lT T 2 )) 
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APPENDIX D. MOLECULAR LIBRARY COMPARISON 



Algorithm D.2 SimiliarityScore^,^): calculate a similarity score between 0 

and 1000 for two sets of theoretical surfaces. 

i : define sets of complementary surface shapes: 

51 *- {S such that for some /, (S, f) € 7i } 

5 2 — {S such that for some /, (S. /) € T 2 } 

For purposes of comparing two theoretical surfaces or surface shapes, note that 
translations and x-y plane rotations of a surface are considered identical to the 
original surface. 

2: s s «- \Si C\S 2 \/\Si u£ 2 |{shape score} 

3: s/ «— ©{functionality score } 

A: for all 5 G Si n S 2 do 

5: define complementary functionalities for this surface shape: 

Fi — {/ such that (S, /) € T x } 
F 2 — {/ such that (S, /) € T 2 } 

6: define sets of "active** cubes, that is cubes with non-default functionality: 

Qi — (<?CZ 3 such that 3/ € Fi,V<? € <?,/($) ^ ^ 8 ) 
Q2 — {<? C 1? such that 3/ € F 2 ^q € Q,/(<7) / ^s) 
Qi — (QCZ 3 such that 3/ € F 1 D F 2 ,Vq € Q t /(<?) ^ -M ' 

7: s, <- s, + \Qi\/{\Si HS 2 \ min(|Qi|, |Qal)) 
8: end for 

9: return <1000s s s/) 
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Appendix E 



Protein Quantization 



In a process complementary to the quantization of a molecular conformation, the target 
sites of a protein surface are quantized into the same negative space cubic representa- 
tion used by theoretical surfaces. This allows the following analyses: 

• comparison between a set of known protein target sites and the set of all possible 
theoretical surfaces within given parameters of volume, shape, and functionality 

• comparison between two different sets of known protein target sites. 

• comparison between a set of known protein target sites and a set of theoretical 
surfaces to which a given set of molecules is complementary 

The protein quantization process is accomplished in the following steps, as depicted 
in Figure 30: 

1 . A 3D crystal structure of the protein is examined and a functionality map is built 
for the protein atoms using the algorithm A TOMFunctionalMaP. 

2. A protein surface is generated from the 3D structure. A protein surface is a set of 
triangles defining the surface of the protein that is accessible to water molecules 
(known as the Connolly surface). Michael Connolly's MSRoIl software is an 
example of a package that can generate a protein surface suitable for this purpose. 

3. Subsets of the surface which are target sites likely for the binding of small 
molecules are detected. This can be accomplished, for example, by looking for 
highly concave regions. Michael Connolly's MSForrn software is an example of 
a package that can measure surface curvature and detect pockets suitable for this 
purpose. 

4. Each target site is isolated and examined individually. 

5. Each target site is quantized into a set of negative space cubes with associated 
functionalities using the protein function map and the algorithm T ARGETSlTE- 
QUANTIZE. The underingly process is very similar to the algorithm Q UANTIZE, 
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using an imaginary molecule that is defined by atoms centered at a set of points 
with a given radius that fill the target site. 

6. Each set of quantized negative space cubes with functionality is convened to 
a set of theoretical surfaces satisfying the proper constraints (for example, no 
occluded cubes are allowed) using the algorithm B UILDSURFACES. 



Algorithm E.l TarGETSiteQuanTIZE(T, /, v, r, r h r v , 6, n p , s r , . . calculate a 
negative space cubic representation of the target site defined by the triangles in set T, 
with v as a normal vector pointing out of the target site, / as a functionality map for the 
entire protein, using parameters: resolution r, lattice density lattice van Der Waals 
radius r Vf lattice tolerance t/, number of lattice neighbors n p , buffer distance 6, search 
radius s r , and additional parameters for subrountine calls (see below) as necessary, 
l: define a coordinate frame Tt with origin at the center of the target site, z axis in the 

direction of v, x and y axes determined by the longest side of the pocket 
2: define a set of points V C R 3 to be the points on a lattice with coordinate frame Ti 
and cube side length rj such that p € P iff p is contained in the target site and the 
closest triangle in T is at least distance b away from p 
3: remove from V all points who do not have at least n p neighboring points also in 
V y where each point has neighbors consisting of the 26 points on the lattice offset 
by at most one cube from the point in question 
4: if V is disconnected, consider each connected component seperately in the follow- 
ing steps 

5: define a "molecular" structure M with conformation c by imagining V to be a set 

of atoms with van Der Waals radius r v 
6: define a base coordinate frame 7& using the algorithm B ASEFRAME on M and c 
7; define a centering coordinate frame T c using the algorithm C enterFrame on M 

and c 

8: define a set of perturbation frames using PerturbationFrames 
9: as in Quantize, by examining all perturbation frames, find the total frame which 
minimizes the number of negative space cubes of resolution r (calculated using 
CUBIFY with tolerance t t ) and the total transformation distance 
10: denote the above set of negative space cubes by Q 

i i: redefine the coordinate system for the negative space cubes in Q, such that the z 
axis is the direction with the greatest component in the v direction, and the maxi- 
mum z value is 0 

12: build a functionality map f q : Q — ♦ T by, for each q € Q, finding the highest 
priority functionality associated by / with an atom in the protein within search 
radius s r to the center of q, or assigning T% if there are no such atoms 

13: return (&/,) 
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AJgorithm E.2 BuiLDSURFACES(Q,/ <?1 m 0 , n/ . . .): return the set of theoretical sur- 
faces satifying proper constraints described by the negative space cubes Q with func- 
tionality / 9 , using parameters: maximum occlusions m 0 , number of functionality 
points n /, additional parameters for subroutine calls (see below) as necessary. 



I 


: define O «— 0 and d : Z 2 — * Z such that d(x,y) = 0{ future surface opening and 




depths} 


2 


for all q € Q do {build an opening and depth function } 


3 


if (<?x,9y) £ O then 


4 


add {q X jQy) to O, d(q Xi q y ) <- max(d(g x , q y ) y 1 - q z ) 


5 


end if 


6 


end for 


7 


o c <— RemoveOccludedCubes(Q,0,oO 


8 


if o c > rn 0 then 


9 


return (0){too many occluded cubes } 


10 


end if 


11 


if O is disconnected then 


12 


define O to be the set of connected components, S «— 0 


13 


for all O c € O do {generate a set of surfaces for each connected component } 


14 


Qc <— {q € Q such that (q Xi q y ) € O c } 


15 


S *- S u BuildSurfaces(Q c , f q ) 


16 


end for 


17 


return (5) 


18 


end if 


19 


if O does not satisfy filtering rules imposed by OPENINGFILTER then 


20 


return (0) 


21 


end if 


22 


S «- shape(0,d) 


23 


if S does not satisfy filtering rules imposed by ShapeFilter then 


24 


return (0) 


25 


end if 


26 


S<-0 


27 


Q a <— {q € Q such that / 9 (<?) ^^sHset of negative space cubes with non- 




default functionality } 


28 


for all Q C Q a such that \Q\ = n f do 


29 


define f s : S — ► ,F such that / a (x,y,z) = f q {^ y y,z) for (x,y) G Q Q , 




/ a (x,y. z) = ^"g otherwise, add (S, / a ) to 5 


30: end for 


31 


return (*S) 
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Algorithm E3 RemoveOCCLUDEDCubes(Q, Oid): remove from Q cubes which 
are occluded, adjusting opening O and depths d, return the number of occluded cubes 
removed. 

l: o c <— 0{countof occluded cubes} 

2: for all (x, y) € O do 

3: for z < 1 to 1 — d{x,y) do 

4: if (x,y,z) € Qand (x,y,z + 1) £ Q then 

5: O c O c + 1 

6: remove (x, y, 2) from Q 

7: end if 

8: end for 

9: Zxy «- {g*,(x,y,^) € 2} 

10: if 2 xy = 0 then 

1 1 : delete (x, y) from O, d{x, y ) «— 0 

12: else 

13: d(x,y) <- min{l - <? z ,<7 2 € Z xy } 

14: end if 

15: end for 

16: return (o c ) 
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Appendix F 



Protein Surface Comparison 



A target surface set is defined as the set of all theoretical target surfaces to which a 
set of known protein surfaces map. The target surface set may comprise, for example, 
all of the surfaces mapped from one protein, all of the surfaces mapped from multiple 
proteins, or all of the surfaces mapped from specific sites on multiple proteins. 

The algorithm PROTEINCOMPARE calculates a score proportional to the similarity 
of the two sets of protein surfaces. The score is calculated by representing each protein 
surface set as its target surface set, and using the S IMILARITYSCORE primitive to 
determine the similarity or dissimilarity two sets of theoretical surfaces. 



Algorithm F.l ProteinCompare^j,?^): calculate a similarity score between 0 
and 1000 for two protein surface sets. 

i : define 7i to be the target surface set corresponding to V\ 

2: define T*i to be the target surface set corresponding to Vi 

3: return (SimilarityScore(7I, 7^)) 
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Protein/Library Comparison 



The algorithm P ROTElNLlBRARYCOMPARE calculates a score proportional to the com- 
plementarity of a library of small molecules and a set of protein surfaces. The score is 
calculated by representing the protein surface set as the theoretical surface set to which 
it is similar, the molecular library as the theoretical surface set to which it is comple- 
mentary, and using the S IMILARITYSCORE primitive to determine the similarity or 
dissimilarity two sets of theoretical surfaces. 



Algorithm G.l ProteinLibraryCompare(£, V): calculate a complementarity 
s core between 0 and 1000 for a molecular library C and a protein surface set V. 

l: define 7J to be the theoretical surfaces complementary to C 

2: define T p to be the target surface set corresponding to V 

3: return (SlMILARITYSCORE(7I, 
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Appendix H 

Parameter Values 



The values of parameters used are as follows: 

• Maximum opening area ( A): 15 

• Maximum surface volume ( V): 1 8 

• Area threshold: (A t ): 8 

• Maximum non-central opening squares ( Af„ c ): 5 

• Maximum contiguous opening squares ( A/ c ): 3 

• Maximum surface extrusions ( M e ): 3 

• Number of surface cubes of specific functionality (n/): 4 

• Maximum number of conformations per molecule: 300 

• Resolution (r): 4.24 Angstroms 

• Tolerance ( t): 0.32 

• Ring factor (17): 0 

• Ring minimum (r m ): 13 

• Radius factor (q r ): 0.75 

• Centering tolerance (c t ): 0.1 

• Centering fraction (c/): 0.75 

• Number of translations (n t ): 5 

• Translationai variance ( v t ): 0.2 

• Number of rotations (n r ): 5 
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• Rotational variance ( v r ): 0. 1 

• Polarizable minimum (p m ): 2 

• Minimum fit quanta (g m i n ): 9 

• Minimum slackness (Smin): 2 

• Maximum slackness (Smax): 0 

• Maximum protrusion levels (pmoj): 1 

• Minimum energy (Emin)'- 8.0 kCal 

• Translational-rotation- vibrational entropy ( EtrvY- —9-0 kCal 

• Rotatable bond coefficient (r c ): —0.7 kCal/bond 

• Hydrophobic energy coefficient (q,): 0.025 kCal/Angstrom 2 

• Hydrophobic surface energy coefficient (c 3 ): 0.8 

• Potential function ( P): kCal 

Surface 



Molecule 







F* 


^3 


Fa 


^5 


F 6 


^7 


F s 




—CO 


4.0 


—CO 


2.0 


— CO 


— oo 


— CO 


— ■CO 




4.0 


— OO 


— CO 


— CO 


2.0 


0.0 


— oo 


— CO 




0.0 


0.0 


— CO 


2.5 


2.5 


— oo 


— oo 


0.0 




2.0 


— oo 


— CO 


— CO 


3.5 


—CO 


— oo 


-1.0 


?S 


—oo 


2.0 


— CO 


3.5 


— oo 


— oo 


— oo 


-0.5 


^6 


—CO 


0.0 


— oo 


-co 


— oc 


2.5 


0.0 


0.3 




— CO 


— OO 


-CO 


-oo 


— oo 


0.0 


0.5 


0.2 



• Lattice density (r/): 1.5 Angstroms 

• Lattice van Der Waals radius ( r v ): 0.75 Angstroms 

• Lattice tolerance ( f./): 0.2 

• Buffer distance ( 6): 0.5 Angstroms 

• Number of lattice neighbors ( n p ): 3 

• Search radius (s r ): 2.22 Angstroms 

• Maximum number of occluded cubes ( m D ): 1 
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CLAIMS 

1 . A computer-based method comprising 

defining a set of constraints on possible target surfaces, 
5 defining a fully enumerated set of theoretical target 

surfaces under the defined constraints, such that each surface has a 
defined, continuous volume and a defined, continuous surface 
area, 

mapping one or more sets of objects to the fully 
1 0 enumerated set of theoretical target surfaces to define 

corresponding subsets of the fully enumerated set of theoretical 
target surfaces, and 

analyzing an aspect of diversity of the objects based on 
degrees of similarities and differences among the corresponding 
1 5 subsets. 

2. The method of claim 1 in which the target surfaces 
comprise negative space target surfaces. 

20 3. The method of claim 1 in which the objects comprise 

positive space object surfaces associated with different molecules. 

4. The method of claim 2 in which the objects comprise 
positive space object surfaces associated with different molecules 
25 and in which the objects are mapped by 

defining corresponding subsets of the fully enumerated set 
of negative space theoretical target surfaces to which positive 
space object surfaces of conformations of molecules are 
complementary, and 
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the aspect of diversity that is analyzed is the difference or 
similarity between the molecules which map to those negative 
space theoretical target surfaces. 

5 5. The method of claim 1 in which the objects comprise 
negative space object surfaces associated with different proteins. 

6. The method of claim 2 in which the objects comprise 
negative space object surfaces associated with different proteins 

10 and in which the objects are mapped by 

defining corresponding subsets of the fully enumerated set 
of negative space theoretical target surfaces to which negative 
space object surfaces of protein pockets are similar, and the aspect 
of diversity that is analyzed is the difference or similarity between 

1 5 protein pockets which map to those negative space theoretical 
target surfaces. 

7. The method of claim 1 in which the objects comprise 
positive space object surfaces associated with different molecules 

20 and negative space object surfaces associated with different 
proteins. 

8. The method of claim 2 in which the objects comprise positive space 
object surfaces associated with different molecules and negative space object 

25 surfaces associated with different proteins and in which, 

in the case of molecules, the objects are mapped by defining 
corresponding subsets of the fully enumerated set of negative space theoretical 
target surfaces to which positive space object surfaces of conformations of 
molecules are complementary, 
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in the case of proteins, the objects are mapped by defining corresponding 
subsets of the folly enumerated set of negative space theoretical target surfaces to 
which negative space object surfaces of protein pockets are similar, and 

the aspect of diversity that is analyzed is the difference or 
5 similarity of the molecules which map to those negative space 
theoretical target surfaces to the protein pockets which map to 
those negative space theoretical target surfaces. 

9. The method of claim 1 in which the theoretical target surfaces comprise 
10 polyhedrons. 

10. The method of claim 1 in which the objects comprise polyhedrons. 

1 1 . The method of claim 9 or 1 0 in which the polyhedrons comprise cubes. 

15 

12. The method of claim 9 or 10 in which the polyhedrons are all of the same 
size and shape. 

13. The method of claim 1 in which the set of all theoretical target surfaces 
20 defines a diversity space within which the diversity of objects can be measured 

by mapping those objects to the diversity space. 

14. The method of claim 13 also including identifying regions of the diversity 
space to which no objects map. 

25 

15. The method of claim 14 also including designing molecules that occupy at 
least one of the unfilled theoretical target surfaces of the diversity space. 

16. The method of claims 4 or 8 in which complementarity is associated with 
30 binding affinities of positive space object surfaces of conformations of molecules 

to negative space theoretical target surfaces. 
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17. The method of claim 1 in which the constraints comprise volume. 

1 8. The method of claim 1 in which the constraints comprise associating each 
5 of a number of sites of the target surface with a preselected molecular property. 

1 9. The method of claim 1 8 in which each of the preselected molecular 
properties is drawn from a larger set of possible molecular properties. 

1 0 20. The method of claim 1 8 in which the preselected molecular properties 
include hydrophobic, polarizable, H-bond acceptor, H-bond donor, H-bond 
donor/acceptor, potentially positively charged, and potentially negatively 
charged. 

15 21. The method of claim 1 8 in which fewer than all of the sites of the target 
surface are each associated with a different one of the molecular properties and 
all of the other sites of the target surface are associated with a common molecular 
property. 

20 22. The method of claim 2 1 in which the common molecular property 
comprises slightly hydrophobic. 

23. The method of claim 1 in which the degrees of similarities or differences 
comprise functional properties associated with the corresponding subsets of the 

25 fully enumerated set of theoretical target surfaces. 

24. The method of claim 1 in which the degrees of similarities or differences 
comprise shape properties associated with the corresponding subsets of the fully 
enumerated set of theoretical target surfaces. 

30 
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25. The method of claim 1 further comprising defining each of the objects by 
quantizing molecules into polyhedrons. 

26. The method of claim 1 also including fitting each of a fixed set of 

5 orientations of each conformation of each of the objects to each of the target 
surfaces. 

27. The method of claim 26 further comprising scoring each of the fittings. 

10 28. The method of claim 9 in which the constraints comprise a resolution of 
the polyhedrons. 

29. The method of claim 28 in which the resolution is 4.24 Angstroms. 

15 30. The method of claim 9 in which the constraints comprise maximum and 
minimum numbers of polyhedrons. 

3 1 . The method of claim 9 in which each of the polyhedrons shares a 
common interface with another of the polyhedrons. 

20 

32. The method of claim 1 in which each of the target surfaces has no 
occlusions of volume greater than a given parameter. 

33. The method of claim 1 in which the target surfaces are defined 
25 conceptually as having been carved out of a flat surface. 

34. A method comprising 

categorizing existing molecules based on negative space target surfaces to 
which conformations of the molecules are complementary, and 
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designing novel molecules that are complementary to negative space 
target surfaces to which no conformations of the existing molecular are 
complementary. 

5 35. A method of creating novel molecules to be tested as ligands for proteins, 
comprising 

categorizing proteins based on target surfaces to which their pockets of 
known structure map, and 

designing novel molecules that are complementary to the negative space 
10 target surfaces to which the protein pockets map. 

36. A computer programmed to determine the chemical similarity of different 
molecules, the program comprising 

approximating the surface shape of each one of a plurality of molecules of 

15 interest by linking a series of cubes, each cube having a dimension R, the 

locations of the cubes being determined by the calculated electron probability 
density of the individual one of the molecules of interest, each cube sharing at 
least one of its six faces with another cube, such that there is a specific number of 
linked cubes which varies for each individual one of the plurality of molecules of 

20 interest; 

approximating the chemical reactivity of each individual one of the 
plurality of molecules of interest by assigning each cube of each individual one of 
the plurality of molecules of interest, no more than one functionality value from a 
plurality of M different chemical functionality values; 

25 approximating the surface shape and chemical reactivity of a chemically 

active surface having a volume equal to V by subtracting a number V/R^ cubes 
of dimension R from a surface, wherein each of the cube spaces shares at least 
one face with another cube space and wherein N of the cube spaces has one of a 
plurality of M different chemical functionality values; 

30 calculating an attraction value K for each one of the plurality of molecules 

of interest to the chemically active surface; and 
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calculating a list of overall attraction values to the chemically active 
surface. 

37. The computer of claim 36, wherein further the calculation of the attraction 
5 value K is performed on a plurality of different predetermined chemically active 

surfaces, and a matrix of overall attractive values of each molecule of interest to 
each of the different surfaces is calculated. 

38. The computer of claim 36, wherein the plurality of molecules of interest 
1 0 includes organic molecules. 

39. The computer of claim 38, wherein further the chemically active surface 
having a plurality of predetermined active chemical locations is calculated to 
correspond to the shape of an actual protein surface structure. 

15 

40. The computer of claim 36, wherein further the molecules of interest are 
organic molecules of 1 500 Daltons or less. 

41 . The computer of claim 36 wherein further the chemically active surface 
20 having a plurality of predetermined active chemical locations is compared to an 

actual protein surface to calculate a similarity value of the actual protein surface 
to the predetermined active chemical locations. 

42. The computer of claim 41 wherein further a plurality of predetermined 
25 chemically active surfaces are compared to a plurality of actual protein surfaces 

and a matrix of similarity values is calculated. 

43. The computer of claim 42 wherein further the cube spaces subtracted 
from the surface are calculated to approximate the electron probability density of 

30 at least one of a plurality of depressions in known protein surface structures. 
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44. The computer of claim 42 wherein further the N sites of chemical 
functionality are calculated to approximate the location and type of chemical 
functionality of actual depressions in known protein structures. 
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Figure 4 
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Fieure 5 
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Figure 6 
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Figure 9 
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Figure 13 
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Figure 15 
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Figure 17 
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Figure 20 
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Figure 22 Table 5: Ranking of molecules in Fig. 5 by QSCD diversity score. Blue = 
homogeneous pairs, yellow = -Hphenyl pairs (8c), green = AT1-AT2 pairs (3,4) 
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Figure 23 Table 6: Ranking of molecules in Fig. 5 by Tanimoto similarity score of 2D UNITY 
fingerprints. Blue = homogeneous pairs, yellow = -Hphenyl pairs (8c), green = AT1-AT2 pairs (3,4) 
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Figure 24 

Example subset of theoretical surfaces Ti containing 4 members: 
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Example Central set Ci for Ti (F = L E = 3) 

where black face denotes a point of attachment Al on Ci: 
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Example Core Molecule Mi to fill Central set Ci: 
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Figure 26 

Example Library L(MuB) where B = a set of amines: 
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Figure 27 

Example subset of target surfaces Ti containing 4 members: 
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Example Central set Ci for Ti (F = 1 5 E = 3) 

where black face denotes a point of attachment Al on Ci: 



A 



_/ 



BNSDOCID: <WO 0060507A2_I_> 



WO 00/60507 24/25 

Figure 28 

Example Core Molecule Mi to fill Central set Ci: 
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Figure 29 

Example Library L(Mi,B) where 




= a set of amines: 
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